Reduce the number of buckets in Sidekiq histograms (!82509) · Merge requests · GitLab.org / GitLab

Bob Van Landuyt requested to merge bvl-reduce-buckets-for-sidekiq-histograms into master Mar 09, 2022

What does this MR do and why?

For https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15380

Because of the wide range of buckets used in for these metrics and the large number of pods running, the cardinality of these series made it hard to query the Prometheus instance serving these.

As a result, some of the metrics that are used for service monitoring and alerting were failing to record in Thanos. By reducing the number of buckets we're hoping to improve the rule evaluations and prevent missing series for Sidekiq

This brings the number of series for the sidekiq_jobs_completion_seconds & sidekiq_jobs_queue_duration_seconds down from +8k to about 1.5k each.

This also reduces the number of buckets used for measuring the total time a job spends per resource: cpu, db, gitaly or elasticsearch.

How to set up and validate locally

Remove any leftover prometheus metrics

→ rm -r tmp/prometheus_multiproc_dir/sidekiq/*

Start the GDK with the sidekiq_exporter configured in GitLab.yml

    sidekiq_exporter:
      enabled: true
      address: localhost
      port: 3807
      log_enabled: true
    sidekiq_health_checks:
     enabled: true
     log_enabled: false
     address: localhost
     port: 8082

Curl the metrics server and count the metrics exported, on this branch:

→ curl -s localhost:3807/metrics | grep sidekiq_jobs_completion_seconds_bucket | wc -l
42

Doing the same on master:

→ curl -s localhost:3807/metrics | grep sidekiq_jobs_completion_seconds_bucket | wc -l
154

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited Mar 09, 2022 by Bob Van Landuyt

Reduce the number of buckets in Sidekiq histograms

What does this MR do and why?

How to set up and validate locally

MR acceptance checklist

Merge request reports