Adds Wal Receiver Saturation indicator (!150544) · Merge requests · GitLab.org / GitLab

Leonardo da Rosa requested to merge 421694-iterate-on-wal-rate-db-health-check-indicator into master Apr 23, 2024

What does this MR do and why?

It adds a new health indicator for BBM, based on the WAL receiver saturation metric.

Query

# main
max(1 - quantile_over_time(0.50, postgres_replication_process_state_ratio{env="gprd", type="patroni", process_type="walreceiver", process_state="S"}[5m]))

=> [{"metric"=>{}, "value"=>[1714755522.823, "0.7833"]}]

# ci
max(1 - quantile_over_time(0.50, postgres_replication_process_state_ratio{env="gprd", type="patroni-ci", process_type="walreceiver", process_state="S"}[5m]))

=> [{"metric"=>{}, "value"=>[1714676243.81, "0.7541625"]}]

CR issues

How to set up and validate locally

Prerequisite: As Thanos cannot be accessed from local machine, we have to mock the promQL result in local.

Scenario 1: Signals::NotAvailable without the required feature flag

Feature.enabled?(:db_health_check_wal_receiver_saturation, type: :ops)
=> false

context = OpenStruct.new(gitlab_schema: :gitlab_main)
indicator = Gitlab::Database::HealthStatus::Indicators::WalReceiverSaturation.new(context)
indicator.evaluate

#<Gitlab::Database::HealthStatus::Signals::NotAvailable:0x000000017907f150 @indicator_class=Gitlab::Database::HealthStatus::Indicators::WalReceiverSaturation, @reason="indicator disabled">

Scenario 2: Signals::Unknown on empty prometheus alert settings

Feature.enable(:db_health_check_wal_receiver_saturation)

application_setting = ApplicationSetting.last
application_setting.update(prometheus_alert_db_indicators_settings: nil)

indicator.evaluate

#<Gitlab::Database::HealthStatus::Signals::Unknown:0x0000000179e90cd0 @indicator_class=Gitlab::Database::HealthStatus::Indicators::WalReceiverSaturation, @reason="Prometheus Settings not configured">

Scenario 3: Signals::Stop on WAL receiver saturation condition not being met

ApplicationSetting.last.update(
  prometheus_alert_db_indicators_settings: {
    prometheus_api_url: '',
    wal_receiver_saturation_sli_query: {
      main_cell: 'max(1 - quantile_over_time(0.50, postgres_replication_process_state_ratio{env="gprd", type="patroni", process_type="walreceiver", process_state="S"}[5m]))',
      main: 'max(1 - quantile_over_time(0.50, postgres_replication_process_state_ratio{env="gprd", type="patroni", process_type="walreceiver", process_state="S"}[5m]))',
      ci: 'max(1 - quantile_over_time(0.50, postgres_replication_process_state_ratio{env="gprd", type="patroni-ci", process_type="walreceiver", process_state="S"}[5m]))'
    },
    wal_receiver_saturation_slo: {
      main: 0.7,
      ci: 0.7
    }
  }
)

# Manually change Gitlab::PrometheusClient.ready? to `return true`
# Manually change Indicators::PrometheusAlertIndicator.fetch_sli to return a value above 0.7, eg: 0.7814833333333333

reload!

indicator = Gitlab::Database::HealthStatus::Indicators::WalReceiverSaturation.new(context)
indicator.evaluate

#<Gitlab::Database::HealthStatus::Signals::Stop:0x0000000178fff108 @indicator_class=Gitlab::Database::HealthStatus::Indicators::WalReceiverSaturation, @reason="WalReceiverSaturation SLI condition not met">

Scenario 4: Signals::Normal on WAL receiver condition being met

# Manually change Indicators::PrometheusAlertIndicator.fetch_sli to return a value below 70000000, eg: 0.68148

reload!

indicator = Gitlab::Database::HealthStatus::Indicators::WalReceiverSaturation.new(context)
indicator.evaluate

#<Gitlab::Database::HealthStatus::Signals::Normal:0x000000017997d7c0 @indicator_class=Gitlab::Database::HealthStatus::Indicators::WalReceiverSaturation, @reason="WalReceiverSaturation SLI condition met">

Scenario 5: Signals::Unknown on WAL receiver condition cannot be calculated

# Manually change Indicators::PrometheusAlertIndicator.fetch_sli to return nil

reload!

indicator = Gitlab::Database::HealthStatus::Indicators::WalReceiverSaturation.new(context)
indicator.evaluate

#<Gitlab::Database::HealthStatus::Signals::Normal:0x000000017997d7c0 @indicator_class=Gitlab::Database::HealthStatus::Indicators::WalReceiverSaturation, @reason="WalReceiverSaturation SLI condition met">

Related to #421694 (closed)

Edited May 03, 2024 by Leonardo da Rosa

Adds Wal Receiver Saturation indicator

What does this MR do and why?

Query

CR issues

How to set up and validate locally

Merge request reports