vLLM x Model Size Requirements: Iteration I
Custom Models customers frequently request information on the system requirements for self-hosted models, to include:
- machine specs
- CPU/GPU requirements
This first iteration of attempting to meet this need is for GitLab to provide some baseline system requirements for supporting OS models on vLLM. For example, Mistral 8x22B takes at least 8 GPUs to even run. Providing customers with a ballpark of baseline requirements (enabling the feature to even operate) to get started with self-hosted models will enable them to move more confidently to procure the necessary pre-reqs and set up their self-hosted models environment. This first iteration would be a 1-time effort to enable self-hosted models to GA.
Proposal
Establish baseline system requirements by:
- Stand up an AI Gateway and GitLab environment, could be a reference environment (with support from @grantyoung )
- Choose 3-4 representative OS models and set them up in our GCP area
- For example, supported OS models could be divided into small (2B), medium (7B), large (20B+)
- Load test harness
- implement a load test; as a first iteration we can select one of our evals and crank the number of requests per minute as high as possible and take notes per model.
- Example:
model | RPM | Failure rate | GPUs |
---|---|---|---|
mistral | 100 | 90% | 1xA100 |
mistral | 60 | 50% | 1xA100 |
mistral | 10 | 1% | 1xA100 |
- Document a starting point for baseline self-hosted model functionality
Definition of Done
- We have documented the results of baseline system requirement testing and honed them into recommendations.
- Recommendations have been published to custom models documentation
- Customers have a reference point for system requirements when choosing among supported inference platforms and models.
Edited by Susie Bitters