Define a guideline for Review Apps resources requests and limits to avoid overcommitting nodes and improve cluster stability
After watching https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-resource-requests-and-limits, I realized that while resources requests are used by the Kubernetes scheduler to decide on which node to schedule a pod, the resource limits are only there for Kubernetes to:
- Throttle pods if the CPU limit is reached
- Evict pods if the memory limit is reached
Given this information and:
- The fact that our nodes p99 CPU utilization is above 100%:
- The fact that we can adjust each pod's resource requests based on their actual p99 utilization (based on )
I think we should adjust:
- Resources requests so that p99 CPU utilization for every component is between 80% and 100%.
- Resources limits so that p99 CPU utilization for every component is be always below 70%.
- Resources requests so that p99 memory utilization for every component is between 80% and 100%.
- Resources limits so that p99 memory utilization for every component is below 70%.
CPU proposal
Component | Current p99 CPU request utilization | Current p99 CPU limit utilization | Current CPU request | Current CPU limit | Proposed CPU request | Proposed CPU limit |
---|---|---|---|---|---|---|
gitaly |
188% | 93% | 600m |
1200m |
1200m (600m * 2) |
1800m (1200m * 1.5) |
gitlab-shell |
170% | 85% | 125m |
250m |
230m (125m * 1.84) |
345m (230m * 1.5) |
sidekiq |
125% | 85% | 500m |
1000m |
650m (500m * 1.3) |
975m (650m * 1.5) |
unicorn |
95% | 65% | 400m |
800m |
500m (400m * 1.25) |
750m (500m * 1.5) |
gitlab-workhorse |
67% | 34% | 300m |
600m |
250m (300m * 0.83) |
375m (250m * 1.5) |
gitlab-runner |
120% | 60% | 355m |
710m |
450m (355m * 1.26) |
675m (450m * 1.5) |
nginx-ingress/controller |
27% | 15% | 100m |
200m |
100m |
200m |
nginx-ingress/defaultBackend |
20% | 13% | 5m |
10m |
5m |
10m |
postgresql |
95% | 60% | 250m |
500m |
300m (250m * 1.2) |
450m (300m * 1.5) |
redis |
20% | 10% | 100m |
200m |
100m |
200m |
Memory proposal
Component | Current p99 MEM request utilization | Current p99 MEM limit utilization | Current MEM request | Current MEM limit | Proposed MEM request | Proposed MEM limit |
---|---|---|---|---|---|---|
gitaly |
120% | TBD | 200M |
420M |
240M (20M * 1.2) |
360M (240M * 1.5) |
gitlab-shell |
125% | TBD | 20M |
40M |
25M (20M * 1.25) |
37.5M (25M * 1.5) |
sidekiq |
110% | TBD | 800M |
1.6G |
880M (800M * 1.1) |
1320M (880M * 1.5) |
unicorn |
110% | TBD | 1.4G |
1.8G |
1540M (1400M * 1.1) |
2310M (1540M * 1.5) |
gitlab-workhorse |
30% | TBD | 100M |
200M |
50M (100M * 0.5) |
75M (50M * 1.5) |
gitlab-runner |
12% | TBD | 300M |
600M |
100M (300M * 0.3) |
150M (100M * 1.5) |
nginx-ingress/controller |
180% | TBD | 250M |
500M |
450M (240M * 1.8) |
675M (450M * 1.5) |
nginx-ingress/defaultBackend |
50% | TBD | 12M |
24M |
12M |
24M |
postgresql |
85% | TBD | 256M |
? | 250M |
375M (250M * 1.5) |
redis |
25% | TBD | 60M |
130M |
30M (60M * 0.5) |
45M (30M * 1.5) |
Edited by Rémy Coutable