Skip to content

Atomically select replicas that meet LSN requirement [RUN ALL RSPEC] [RUN AS-IF-FOSS]

Stan Hu requested to merge sh-load-balancer-atomic-replica into master

During a merge, we attempt to find a matching merge request with a SHA using a replica that should be up-to-date with the primary for a given PostgreSQL log sequence number (LSN) (https://gitlab.com/gitlab-org/gitlab/blob/e20eacf711fbb6bf503361025406e24bcae47329/ee/lib/ee/gitlab/checks/matching_merge_request.rb#L30). However, there is a race condition that can happen if service discovery alters the host list after this check has taken place. This most likely happens when a Web worker starts:

  1. When Rails starts up for the first time, there is a 1-minute or 2-minute delay before service discovery finds replicas (see #271575 (closed)).

  2. During this time LoadBalancer#all_caught_up? will return true. This will indicate to the Web worker that it can use replicas and does not have to use the primary.

  3. During a request, service discovery may load all the replicas and change the host list. As a result, the next read may be directed to a lagging replica.

However, this may cause a merge to fail if it cannot find a match. We can see this in #247857 (comment 531936962) where replication lag increased and number of available load balancing hosts jumped from 0 to 18 around the same time.

When a user merges a merge request, Sidekiq logs the minimum LSN needed to match a merge request for the API. If we have this LSN, we now:

  1. Select from the available list of replicas that meet this LSN requirement.
  2. Store this subset for the given request.
  3. Round-robin reads with this subset of replicas.

Relates to #247857 (closed)

Testing (what I did)

  1. Set up 3 PostgreSQL replicas (see gitlab-development-kit!1901 (diffs) for updated instructions).
  2. In 1 of the replicas, set recovery_min_apply_delay = '20s' in data/gitlab.conf. This will make 1 replica lag the others.
  3. Validate the GDK is using all 3 replicas via log_statement = 'all' in the PostgreSQL configuration. Check the replica keyword in the performance bar as well.
  4. Enable the feature flag: Feature.enable(:load_balancing_atomic_replica).
  5. In a test repo, add master as a protected branch that No one can push.
  6. Apply this diff for logging:
diff --git a/ee/lib/gitlab/database/load_balancing/load_balancer.rb b/ee/lib/gitlab/database/load_balancing/load_balancer.rb
index 50486842a37..140cf6d7684 100644
--- a/ee/lib/gitlab/database/load_balancing/load_balancer.rb
+++ b/ee/lib/gitlab/database/load_balancing/load_balancer.rb
@@ -157,6 +157,7 @@ def select_caught_up_hosts(location)
           all_hosts = @host_list.hosts
           valid_hosts = all_hosts.select { |host| host.caught_up?(location) }
 
+          Rails.logger.info "=== valid hosts: #{valid_hosts.map(&:host)}"
           return false if valid_hosts.empty?
 
           # Hosts can come online after the time when this scan was done,
@@ -168,6 +169,8 @@ def select_caught_up_hosts(location)
           # pick a random host and mix up the original list to ensure we don't
           # only end up using one replica.
           RequestStore[CACHE_KEY] = valid_hosts.sample
+
+          Rails.logger.info "=== picked host #{RequestStore[CACHE_KEY].host}"
           @host_list.shuffle
 
           true
diff --git a/ee/lib/gitlab/database/load_balancing/sticking.rb b/ee/lib/gitlab/database/load_balancing/sticking.rb
index efbd7099300..6a988d2e146 100644
--- a/ee/lib/gitlab/database/load_balancing/sticking.rb
+++ b/ee/lib/gitlab/database/load_balancing/sticking.rb
@@ -46,6 +46,8 @@ def self.all_caught_up?(namespace, id)
         def self.select_caught_up_replicas(namespace, id)
           location = last_write_location_for(namespace, id)
 
+          Rails.logger.info "=== last write location #{location}"
+
           # Unlike all_caught_up?, we return false if no write location exists.
           # We want to be sure we talk to a replica that has caught up for a specific
           # write location. If no such location exists, err on the side of caution.

Now that this is set up:

  1. Create a merge request.
  2. Click Merge.

In development.log, I saw that it properly selected the second and third replica, and most important the merge was successful.

=== last write location 14/303C2B40
=== valid hosts: ["/Users/stanhu/gdk-ee/postgresql-replica/data3", "/Users/stanhu/gdk-ee/postgresql-replica/data2"]
=== picked host /Users/stanhu/gdk-ee/postgresql-replica/data2
Edited by Stan Hu

Merge request reports

Loading