Reduce the number of buckets in Sidekiq histograms
What does this MR do and why?
For https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15380
Because of the wide range of buckets used in for these metrics and the large number of pods running, the cardinality of these series made it hard to query the Prometheus instance serving these.
As a result, some of the metrics that are used for service monitoring and alerting were failing to record in Thanos. By reducing the number of buckets we're hoping to improve the rule evaluations and prevent missing series for Sidekiq
This brings the number of series for the
sidekiq_jobs_completion_seconds
&
sidekiq_jobs_queue_duration_seconds
down from +8k to about 1.5k
each.
This also reduces the number of buckets used for measuring the total time a job spends per resource: cpu, db, gitaly or elasticsearch.
How to set up and validate locally
Remove any leftover prometheus metrics
→ rm -r tmp/prometheus_multiproc_dir/sidekiq/*
Start the GDK with the sidekiq_exporter
configured in GitLab.yml
sidekiq_exporter:
enabled: true
address: localhost
port: 3807
log_enabled: true
sidekiq_health_checks:
enabled: true
log_enabled: false
address: localhost
port: 8082
Curl the metrics server and count the metrics exported, on this branch:
→ curl -s localhost:3807/metrics | grep sidekiq_jobs_completion_seconds_bucket | wc -l
42
Doing the same on master:
→ curl -s localhost:3807/metrics | grep sidekiq_jobs_completion_seconds_bucket | wc -l
154
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.