Atomically select replicas that meet LSN requirement [RUN ALL RSPEC] [RUN AS-IF-FOSS]
During a merge, we attempt to find a matching merge request with a SHA using a replica that should be up-to-date with the primary for a given PostgreSQL log sequence number (LSN) (https://gitlab.com/gitlab-org/gitlab/blob/e20eacf711fbb6bf503361025406e24bcae47329/ee/lib/ee/gitlab/checks/matching_merge_request.rb#L30). However, there is a race condition that can happen if service discovery alters the host list after this check has taken place. This most likely happens when a Web worker starts:
-
When Rails starts up for the first time, there is a 1-minute or 2-minute delay before service discovery finds replicas (see #271575 (closed)).
-
During this time
LoadBalancer#all_caught_up?
will returntrue
. This will indicate to the Web worker that it can use replicas and does not have to use the primary. -
During a request, service discovery may load all the replicas and change the host list. As a result, the next read may be directed to a lagging replica.
However, this may cause a merge to fail if it cannot find a match. We can see this in #247857 (comment 531936962) where replication lag increased and number of available load balancing hosts jumped from 0 to 18 around the same time.
When a user merges a merge request, Sidekiq logs the minimum LSN needed to match a merge request for the API. If we have this LSN, we now:
- Select from the available list of replicas that meet this LSN requirement.
- Store this subset for the given request.
- Round-robin reads with this subset of replicas.
Relates to #247857 (closed)
Testing (what I did)
- Set up 3 PostgreSQL replicas (see gitlab-development-kit!1901 (diffs) for updated instructions).
- In 1 of the replicas, set
recovery_min_apply_delay = '20s'
indata/gitlab.conf
. This will make 1 replica lag the others. - Validate the GDK is using all 3 replicas via
log_statement = 'all'
in the PostgreSQL configuration. Check thereplica
keyword in the performance bar as well. - Enable the feature flag:
Feature.enable(:load_balancing_atomic_replica)
. - In a test repo, add
master
as a protected branch thatNo one
can push. - Apply this diff for logging:
diff --git a/ee/lib/gitlab/database/load_balancing/load_balancer.rb b/ee/lib/gitlab/database/load_balancing/load_balancer.rb
index 50486842a37..140cf6d7684 100644
--- a/ee/lib/gitlab/database/load_balancing/load_balancer.rb
+++ b/ee/lib/gitlab/database/load_balancing/load_balancer.rb
@@ -157,6 +157,7 @@ def select_caught_up_hosts(location)
all_hosts = @host_list.hosts
valid_hosts = all_hosts.select { |host| host.caught_up?(location) }
+ Rails.logger.info "=== valid hosts: #{valid_hosts.map(&:host)}"
return false if valid_hosts.empty?
# Hosts can come online after the time when this scan was done,
@@ -168,6 +169,8 @@ def select_caught_up_hosts(location)
# pick a random host and mix up the original list to ensure we don't
# only end up using one replica.
RequestStore[CACHE_KEY] = valid_hosts.sample
+
+ Rails.logger.info "=== picked host #{RequestStore[CACHE_KEY].host}"
@host_list.shuffle
true
diff --git a/ee/lib/gitlab/database/load_balancing/sticking.rb b/ee/lib/gitlab/database/load_balancing/sticking.rb
index efbd7099300..6a988d2e146 100644
--- a/ee/lib/gitlab/database/load_balancing/sticking.rb
+++ b/ee/lib/gitlab/database/load_balancing/sticking.rb
@@ -46,6 +46,8 @@ def self.all_caught_up?(namespace, id)
def self.select_caught_up_replicas(namespace, id)
location = last_write_location_for(namespace, id)
+ Rails.logger.info "=== last write location #{location}"
+
# Unlike all_caught_up?, we return false if no write location exists.
# We want to be sure we talk to a replica that has caught up for a specific
# write location. If no such location exists, err on the side of caution.
Now that this is set up:
- Create a merge request.
- Click
Merge
.
In development.log
, I saw that it properly selected the second and third replica, and most important the merge was successful.
=== last write location 14/303C2B40
=== valid hosts: ["/Users/stanhu/gdk-ee/postgresql-replica/data3", "/Users/stanhu/gdk-ee/postgresql-replica/data2"]
=== picked host /Users/stanhu/gdk-ee/postgresql-replica/data2