Use active sidekiq router's queues for sidekiq/queue_metrics API (!87524) · Merge requests · GitLab.org / GitLab

Quang-Minh Nguyen requested to merge qmnguyen0711/fix-sidekiq-metrics-endpoint into master May 13, 2022

What does this MR do and why?

As discussed in an internal slack thread, there is one problem with GET api/v4/sidekiq/queue_metrics endpoint. That endpoint is supposed to return the list of Sidekiq queues and corresponding backlog and latency numbers. Previously, when we follow queue-per-worker, that endpoint returns a curated list of around 500 queues. After we migrated to use queue-per-shard and queue routing rules, the number of queues drops significantly, down to a handful of queues. Unfortunately, we still maintain a list of queue names generated from worker names. That list is persisted in Redis and can be accessed with Sidekiq::Queue API. The redundant queues can only be removed after gitlab-com/gl-infra&596 (closed) is done.

This MR makes that endpoint return the data for active routing queues only. The list of queues is now generated by pushing the list of workers to global Sidekiq router.

How to set up and validate locally

Apply the production routing rules to local environment
Issue curl --header "PRIVATE-TOKEN: $TOKEN" "http://localhost:3000/api/v4/sidekiq/queue_metrics" command against the local web server. The results are different before and after the change is made.
Before

{
  "queues": {
    "adjourned_project_deletion": {
      "backlog": 0,
      "latency": 0
    },
    "admin_emails": {
      "backlog": 0,
      "latency": 0
    },
    "analytics_code_review_metrics": {
      "backlog": 0,
      "latency": 0
    },
    "analytics_devops_adoption_create_snapshot": {
      "backlog": 0,
      "latency": 0
    },
    "analytics_usage_trends_counter_job": {
      "backlog": 0,
      "latency": 0
    },
    ... 500+ more
}

After

{
  "queues": {
    "database_throttled": {
      "backlog": 0,
      "latency": 0
    },
    "default": {
      "backlog": 0,
      "latency": 0
    },
    "elasticsearch": {
      "backlog": 177,
      "latency": 2070042
    },
    "email_receiver": {
      "backlog": 0,
      "latency": 0
    },
    "gitaly_throttled": {
      "backlog": 0,
      "latency": 0
    },
    "imports": {
      "backlog": 0,
      "latency": 0
    },
    "low_urgency_cpu_bound": {
      "backlog": 108,
      "latency": 2070042
    },
    "mailers": {
      "backlog": 0,
      "latency": 0
    },
    "memory_bound": {
      "backlog": 0,
      "latency": 0
    },
    "quarantine": {
      "backlog": 0,
      "latency": 0
    },
    "service_desk_email_receiver": {
      "backlog": 0,
      "latency": 0
    },
    "urgent_cpu_bound": {
      "backlog": 0,
      "latency": 0
    },
    "urgent_other": {
      "backlog": 3,
      "latency": 1564815
    }
  }
}

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited May 13, 2022 by Quang-Minh Nguyen

Use active sidekiq router's queues for sidekiq/queue_metrics API

What does this MR do and why?

How to set up and validate locally

MR acceptance checklist

Merge request reports