Geo: Make metrics collection robust

Problem

Geo::MetricsUpdateWorker is inherently unreliable in its current form. Its design is a root cause of many support tickets, especially for deployments with a lot of data.

Every time a Geo count query fails, it blocks all Geo metrics collection and observability (example).

In this case, another team modified a relation, which lead to a batch-count-related query timing out for a customer (I think they have 10s of millions of job artifact rows). It is predictable that this class of problem will happen from time-to-time. Therefore metrics collection design should take it into account.

Org Mover

The Org Mover need is similar to Geo:

We add an Org to be moved from Cell X to Y
Org Mover metrics collection starts running queries in Cell X and Y
A query starts timing out in Cell X or Y
All Org Mover metrics collection stops for that Cell
The fix is not trivial, and takes 4 weeks to get released
During that 4 weeks, work is slowed down on resolving Org Mover sync/verification failures specific to that Org

Proposal MVC - Rescue and set a fallback for each metric

This is the most minimal thing we can do to avoid one timing-out metric from impacting all others.

Weight 4 mostly for writing tests.

Proposal Redesign - Adapt Service Ping code

Click here to expand

The nature of the problem to solve is almost exactly the same as old Service Ping. We need to collect a wide variety of metrics, sliced in many different ways.

We already reuse BatchCounter, which was developed for old Service Ping. We should borrow more techniques from Service Ping. Service Ping has already solved many of the same problems. We don't need to reinvent the wheel.

Unfortunately, due to performance problems, it seems that the Service Ping code which we would like to reuse is deprecated. groupanalytics instrumentation is pushing all metrics collection toward event tracking. This is inherently efficient. But it's best for tracking deltas, and is not as reliable for total counts.

Click here to expand more details

More details

Some similar challenges:

There are many potentially heavy metrics to collect.
Any particularly slow or timing-out metric should not block the collection of other metrics.
The act of collecting metrics should add as little system load as possible.

However, one crucial difference is that Service Ping only needs to send a weekly payload, whereas we rely on Geo::MetricsUpdateWorker to provide as close to real-time data as possible for Admin > Geo > Sites dashboard data, as well as for Prometheus/Grafana dashboards and alerting. Therefore we may have to walk a much finer line to balance all needs.

Implementation

Copy and adapt the Service Ping architecture into Geo for metrics collection. Store metrics in Redis for Geo::MetricsUpdateWorker to collect later?
Duplicate a Geo metric into the new architecture.
Make Geo::MetricsUpdateWorker collect the duplicated metric rather than calculating it directly.
Ensure backward-compatibility so we don't have to wait for 18.0 or force customers to make a big transition.
- Admin > Geo > Sites dashboard must continue to work.
- The primary site must continue to export metrics gathered from all secondary sites as well.
- The primary site can export new metrics, but it must continue to export existing metrics under the same name.
Migrate all the rest of the metrics to the new architecture.

There may be similarities with Migrate Service Ping metrics to instrumentation classes. If so, maybe this is straightforward enough that after the first few metrics, we can solicit Seeking community contributions.

I started playing with Service Ping to see what it looks like to add a metric, see 2a9d4ee2:

Click here to expand the diff

diff --git a/config/metrics/counts_all/20240709022641_count_job_artifacts_verification_succeeded.yml b/config/metrics/counts_all/20240709022641_count_job_artifacts_verification_succeeded.yml
new file mode 100644
index 000000000000..9a016537e101
--- /dev/null
+++ b/config/metrics/counts_all/20240709022641_count_job_artifacts_verification_succeeded.yml
@@ -0,0 +1,19 @@
+---
+key_path: counts.job_artifacts_verification_succeeded
+description: Count of job artifacts with verification succeeded
+product_group: geo
+value_type: number
+status: active
+milestone: "17.2"
+introduced_by_url:
+time_frame: all
+data_source: database
+data_category: optional
+performance_indicator_type: []
+distribution:
+- ce
+- ee
+tier:
+- free
+- premium
+- ultimate
diff --git a/lib/gitlab/usage/metrics/instrumentations/count_job_artifacts_verification_succeeded_metric.rb b/lib/gitlab/usage/metrics/instrumentations/count_job_artifacts_verification_succeeded_metric.rb
new file mode 100644
index 000000000000..3ec480658df3
--- /dev/null
+++ b/lib/gitlab/usage/metrics/instrumentations/count_job_artifacts_verification_succeeded_metric.rb
@@ -0,0 +1,17 @@
+# frozen_string_literal: true
+
+module Gitlab
+  module Usage
+    module Metrics
+      module Instrumentations
+        class CountJobArtifactsVerificationSucceededMetric < DatabaseMetric
+          operation :count
+
+          relation do
+            ::Ci::JobArtifact.verification_succeeded
+          end
+        end
+      end
+    end
+  end
+end
diff --git a/spec/lib/gitlab/usage/metrics/instrumentations/count_job_artifacts_verification_succeeded_metric_spec.rb b/spec/lib/gitlab/usage/metrics/instrumentations/count_job_artifacts_verification_succeeded_metric_spec.rb
new file mode 100644
index 000000000000..feb2ae0b9b1c
--- /dev/null
+++ b/spec/lib/gitlab/usage/metrics/instrumentations/count_job_artifacts_verification_succeeded_metric_spec.rb
@@ -0,0 +1,20 @@
+# frozen_string_literal: true
+
+require 'spec_helper'
+
+RSpec.describe Gitlab::Usage::Metrics::Instrumentations::CountJobArtifactsVerificationSucceededMetric, feature_category: :service_ping do
+  let_it_be(:excluded) { create(:ee_ci_job_artifact, :verification_failed) }
+
+  context 'with a verification succeeded job artifact' do
+    let(:expected_value) { 1 }
+    let_it_be(:verification_succeeded_job_artifact) { create(:ee_ci_job_artifact, :verification_succeeded) }
+
+    it_behaves_like 'a correct instrumented metric value', { time_frame: 'all', data_source: 'database' }
+  end
+
+  context 'with no verification succeeded job artifact' do
+    let(:expected_value) { 0 }
+
+    it_behaves_like 'a correct instrumented metric value', { time_frame: 'all', data_source: 'database' }
+  end
+end

Weight 8 to migrate a few metrics to a new architecture, and then a lot of weight 1 issues to migrate the rest of the metrics.

Edited Nov 27, 2024 by Michael Kozono