vLLM x Model Size Requirements: Iteration I

Custom Models customers frequently request information on the system requirements for self-hosted models, to include:

machine specs
CPU/GPU requirements

This first iteration of attempting to meet this need is for GitLab to provide some baseline system requirements for supporting OS models on vLLM. For example, Mistral 8x22B takes at least 8 GPUs to even run. Providing customers with a ballpark of baseline requirements (enabling the feature to even operate) to get started with self-hosted models will enable them to move more confidently to procure the necessary pre-reqs and set up their self-hosted models environment. This first iteration would be a 1-time effort to enable self-hosted models to GA.

Proposal

Establish baseline system requirements by:

Stand up an AI Gateway and GitLab environment, could be a reference environment (with support from @grantyoung )
Choose 3-4 representative OS models and set them up in our GCP area
- For example, supported OS models could be divided into small (2B), medium (7B), large (20B+)
Load test harness
implement a load test; as a first iteration we can select one of our evals and crank the number of requests per minute as high as possible and take notes per model.
- Example:

model	RPM	Failure rate	GPUs
mistral	100	90%	1xA100
mistral	60	50%	1xA100
mistral	10	1%	1xA100

Document a starting point for baseline self-hosted model functionality

Definition of Done

We have documented the results of baseline system requirement testing and honed them into recommendations.
Recommendations have been published to custom models documentation
Customers have a reference point for system requirements when choosing among supported inference platforms and models.

Edited Dec 28, 2024 by Susie Bitters