Skip to content

Adds Wal Receiver Saturation indicator

What does this MR do and why?

It adds a new health indicator for BBM, based on the WAL receiver saturation metric.

Query

# main
max(1 - quantile_over_time(0.50, postgres_replication_process_state_ratio{env="gprd", type="patroni", process_type="walreceiver", process_state="S"}[5m]))

=> [{"metric"=>{}, "value"=>[1714755522.823, "0.7833"]}]

# ci
max(1 - quantile_over_time(0.50, postgres_replication_process_state_ratio{env="gprd", type="patroni-ci", process_type="walreceiver", process_state="S"}[5m]))

=> [{"metric"=>{}, "value"=>[1714676243.81, "0.7541625"]}]

CR issues

How to set up and validate locally

Prerequisite: As Thanos cannot be accessed from local machine, we have to mock the promQL result in local.

Scenario 1: Signals::NotAvailable without the required feature flag

Feature.enabled?(:db_health_check_wal_receiver_saturation, type: :ops)
=> false

context = OpenStruct.new(gitlab_schema: :gitlab_main)
indicator = Gitlab::Database::HealthStatus::Indicators::WalReceiverSaturation.new(context)
indicator.evaluate

#<Gitlab::Database::HealthStatus::Signals::NotAvailable:0x000000017907f150 @indicator_class=Gitlab::Database::HealthStatus::Indicators::WalReceiverSaturation, @reason="indicator disabled">

Scenario 2: Signals::Unknown on empty prometheus alert settings

Feature.enable(:db_health_check_wal_receiver_saturation)

application_setting = ApplicationSetting.last
application_setting.update(prometheus_alert_db_indicators_settings: nil)

indicator.evaluate

#<Gitlab::Database::HealthStatus::Signals::Unknown:0x0000000179e90cd0 @indicator_class=Gitlab::Database::HealthStatus::Indicators::WalReceiverSaturation, @reason="Prometheus Settings not configured">

Scenario 3: Signals::Stop on WAL receiver saturation condition not being met

ApplicationSetting.last.update(
  prometheus_alert_db_indicators_settings: {
    prometheus_api_url: '',
    wal_receiver_saturation_sli_query: {
      main_cell: 'max(1 - quantile_over_time(0.50, postgres_replication_process_state_ratio{env="gprd", type="patroni", process_type="walreceiver", process_state="S"}[5m]))',
      main: 'max(1 - quantile_over_time(0.50, postgres_replication_process_state_ratio{env="gprd", type="patroni", process_type="walreceiver", process_state="S"}[5m]))',
      ci: 'max(1 - quantile_over_time(0.50, postgres_replication_process_state_ratio{env="gprd", type="patroni-ci", process_type="walreceiver", process_state="S"}[5m]))'
    },
    wal_receiver_saturation_slo: {
      main: 0.7,
      ci: 0.7
    }
  }
)

# Manually change Gitlab::PrometheusClient.ready? to `return true`
# Manually change Indicators::PrometheusAlertIndicator.fetch_sli to return a value above 0.7, eg: 0.7814833333333333

reload!

indicator = Gitlab::Database::HealthStatus::Indicators::WalReceiverSaturation.new(context)
indicator.evaluate

#<Gitlab::Database::HealthStatus::Signals::Stop:0x0000000178fff108 @indicator_class=Gitlab::Database::HealthStatus::Indicators::WalReceiverSaturation, @reason="WalReceiverSaturation SLI condition not met">

Scenario 4: Signals::Normal on WAL receiver condition being met

# Manually change Indicators::PrometheusAlertIndicator.fetch_sli to return a value below 70000000, eg: 0.68148

reload!

indicator = Gitlab::Database::HealthStatus::Indicators::WalReceiverSaturation.new(context)
indicator.evaluate

#<Gitlab::Database::HealthStatus::Signals::Normal:0x000000017997d7c0 @indicator_class=Gitlab::Database::HealthStatus::Indicators::WalReceiverSaturation, @reason="WalReceiverSaturation SLI condition met">

Scenario 5: Signals::Unknown on WAL receiver condition cannot be calculated

# Manually change Indicators::PrometheusAlertIndicator.fetch_sli to return nil

reload!

indicator = Gitlab::Database::HealthStatus::Indicators::WalReceiverSaturation.new(context)
indicator.evaluate

#<Gitlab::Database::HealthStatus::Signals::Normal:0x000000017997d7c0 @indicator_class=Gitlab::Database::HealthStatus::Indicators::WalReceiverSaturation, @reason="WalReceiverSaturation SLI condition met">

Related to #421694 (closed)

Edited by Leonardo da Rosa

Merge request reports

Loading