Fix all_caught_up? unsticking too early (!122424) · Merge requests · GitLab.org / GitLab

Dylan Griffith requested to merge 413203-fix-all-caught-up-optimistic-unstick into master Jun 02, 2023

What does this MR do and why?

Background Reading

::Gitlab::Database::LoadBalancing is a part of our database architecture that ensures we can balance our queries across our primary and replica databases. One of the key components is deciding when a query must go to the primary and when it can be sent to the replices. The replicas have some delay so it's important we only use them in workloads where a small amount of replication lag won't cause problems. Replication lag is usually within a few seconds at the most so for many workloads it's OK to be reading from a replica.

The ::Gitlab::Database::LoadBalancing::Sticking class is responsible for ensuring that when a user performs a write query (eg INSERT INTO issues) then we stick to the primary database for some period of time such that subsequent read queries (eg. SELECT * FROM issues) would always go to the primary. We call this "sticking". If we didn't do this then there would be instances where a user (for example) creates an issue and then gets redirected to that issue and gets a 404. We are ok with some other user getting a 404 if they somehow manage to get the URL for the issue that another user created within the last second but it's not OK for a user to not see the data that they just wrote as this would appear to be a bug.

Sticking in GitLab tries to be very precise to ensure that we can start sending queries back to replicas as soon as it's safe to do so. We don't just arbitrarily stick to the primary for some long period of time because this would not efficiently utilize our replicas. The way we do this precisely is by keeping track of the LSN which is an internal Postgres identifier that represents a position in the Postgres history. These LSN positions are numerically comparable such that you can be sure if a replica has replicated position "4" then we can safely assume it has also replicated position "3". We utilize this fact by storing the LSN that corresponds to the write that the user just performed in Redis. Then the next time this same user tries to perform a read query we check if there is any replica that has an LSN greater than the last written LSN. If we find all the replicas are caught up we can also safely "unstick" the user which is done by deleting the LSN position from Redis, because there is no need to check over and over for a caught up replica when any replica will already be ahead of this position. (Side note: In writing this MR description I realised there is yet another more complicated race condition in this logic that I've extracted into #421868 but that problem already exists so I won't address that in this MR).

This "unsticking" logic previously had a bug in which we were "unsticking" the user as soon as we found some caught up replica. This was an efficient optimization in the case where all the replicas are pretty much as caught up as each other. But if there is some case where 1 replica is lagging behind a bit more than the other replicas this could lead to us choosing this replica for subsequent queries and this user might start to see bugs where data just disappears after loading another page or even in the same page view because AJAX requests are loading lots of different elements of data across the same page load.

This sticking logic above is actually a little more sophisticated and has the following capabilities:

Stickiness doesn't need to be "per-user" but could also be "per-project", "per-MR", "per-build" or "per-runner". This is because some async workloads are not necessarily just user based and stale data would cause application problems if the application starts to see out of date data. For example merging a merge request is done across a few steps and at each point we need to ensure that the data is up to date to ensure there is no edge cases.
Stickiness in Sidekiq is handled differently because it is not user-based. Instead of storing the LSN in Redis we store the LSN in the Sidekiq job metadata. When a Sidekiq job is queued we track the LSN position of the primary in the metadata. When the job is picked up we can check all replicas to find if any one of them is caught up with this position and then we can use that replica for read queries in the context of that job if one is found
In the context of a web request we always remember the replica we choose. This is stored in SafeRequestStore which is cleared between web requests. This is important to avoid weird edge cases where sequential reads see different states of history

The actual change in this MR

The main change in this MR is to fix the bug where we cleared the LSN position from Redis once we found a caught up replica. This you will see in the changes to the Sticking class and LoadBalancer. These classes need to now distinguish between:

:none: no replicas caught up => we must use the primary for queries
:any: some replica is caught up => this is stored in SafeRequestStore and can be used for read queries
:all: all replicas are caught up => this is stored in SafeRequestStore and can be used for read queries and additionally the Sticking class can now safely clear the LSN position from Redis and not check again in future

In making this change it became clear there are way too many entrypoints into effectively the same logic being used slightly differently. All code paths were audited that used Sticking and a general theme emerged. Basically all callers only needed to do 1 of 2 things:

Find a caught up replica => #find_caught_up_replica
Remember the LSN for a recent write => #stick

So we reduced the public API down to just these 2 methods (3 if you include bulk_stick). There are a few callers that want slightly different behaviour from find_caught_up_replica so there are now a few optional arguments to this method. Ideally we'd narrow this down further and remove some of these arguments but it seemed safer (without extensively testing) to keep the existing behaviour from some of the deprecated methods.

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

Before	After

How to set up and validate locally

Configure load balancing with service discovery and many replicas in https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/database_load_balancing_with_service_discovery.md
Update <GDK_ROOT>/postgresql-replica-2/data/postgresql.conf with recovery_min_apply_delay = 5s
Create an issue in any project
Without the changes in this MR you'll often see some AJAX requests fail with 404 errors, with this change you should not see those 404 errors
Test other features still work as expected

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Related to #413203 (closed)

Edited Aug 14, 2023 by Dylan Griffith

Fix all_caught_up? unsticking too early