Skip to content

Add metrics for Redis Cluster redirection

Sylvester Chin requested to merge sc1-cluster-redirection-metrics into master

What does this MR do and why?

This MR adds a new counter gitlab_redis_client_redirections_total which tracks the number of MOVED and ASK errors. Such redirection errors are retried in https://github.com/redis/redis-rb/blob/v4.8.0/lib/redis/cluster.rb#L220 which usually gets redirected to the correct node. During cluster re-sharding or possibly improper client setup (see issue below), it would lead to a false positive in the service error ratio.

See related issue: gitlab-com/gl-infra/scalability#2212 (closed)

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

How to set up and validate locally

Setup: Follow step 1-4 in gitlab-com/gl-infra/scalability#2212 (comment 1287797884)

Option 1: DIY

  1. Open GDK and trigger some MOVED or ASK
Gitlab::Redis::ClusterRateLimiting.with{|c| c.get('b')} # in this case slot 3300 was being fiddled with so `b` would give us problems

Option 2: Run the specs in this MR

Verify via gdk (check if metrics are enabled)

[10] pry(main)> Gitlab::Metrics.metrics_folder_present?
=> true
[10] pry(main)> require 'prometheus/client/formats/text.rb'
=> true
[14] pry(main)> puts Prometheus::Client::Formats::Text.marshal_multiprocess.split("\n").filter{|x| x.include?('gitlab_redis_client_redirections_total')}
# HELP gitlab_redis_client_redirections_total Multiprocess metric
# TYPE gitlab_redis_client_redirections_total counter
gitlab_redis_client_redirections_total{node_key="127.0.0.1:6380",redirection_type="ASK",slot="123",storage="shared_state"} 4
gitlab_redis_client_redirections_total{node_key="127.0.0.1:6380",redirection_type="MOVED",slot="123",storage="shared_state"} 5
gitlab_redis_client_redirections_total{node_key="127.0.0.1:7001",redirection_type="ASK",slot="3300",storage="cluster_rate_limiting"} 1
=> nil

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Vitali Tatarintev

Merge request reports

Loading