Consolidate and improve Gitaly cgroups documentation
Context
Documentation covering how Gitaly cgroups should be configured are currently spread across several locations, including:
- https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/gitaly/gitaly-repos-cgroup.md
- https://docs.gitlab.com/ee/administration/gitaly/configure_gitaly.html#control-groups
- https://docs.gitlab.com/ee/administration/gitaly/monitoring.html#monitor-gitaly-cgroups
- https://gitlab.com/gitlab-org/gitaly/-/blob/master/doc/cgroups.md
- !4461 (comment 905883341)
- #3049 (closed)
- https://docs.gitlab.com/ee/administration/gitaly/kubernetes.html#constrain-git-processes-resource-usage
Cgroups should be enabled on Gitaly nodes under two circumstances:
- When a Gitaly node is serving heavy, non-uniformly distributed workloads.
- If Gitaly is running on Kubernetes, as a protection mechanism against pod eviction on top of the reason above.
The current cgroups documentation, particularly the control groups section in the Configure Gitaly document, acts as a reference rather than a tutorial. Since cgroups tuning can be quite complex, we should offer additional guidance in line with best practices and empirical observations from running our own infrastructure.
Proposal
Review the documents in the list above, and compile key points into a central document. The resulting document should follow a tutorial style that guides the administrator on how to:
- Measure the current baseline Git workload demands
- Configure appropriate cgroups parent and repository limits
- Monitor cgroups-related metrics to determine if additional tuning is required
We should draw from our own experiences tuning cgroups values for gitlab.com