Allow using db replicas for GraphQL subs
What does this MR do and why?
We use GraphQL subscriptions to push changes to subscribers over a websocket. Since these events propagate from any GitLab node over Redis PubSub, they are susceptible to "read your own writes" problems such as experiencing replication lag, where a write has not fully propagated to replicas yet before the subscriber reads it back. To fix this, we have so far forced these reads to go to the postgres primary. However, this is unsustainable if we want to grow real-time adoption at GitLab.
We already have a solution for addressing replication lag issues with Sidekiq: capture the WAL (write-ahead log) location just before scheduling the job, passing it through Redis to the worker, then finding a replica that is caught up to this location. Only if we cannot find any such replica, we still fall back to the primary.
I adopted the same approach here for GraphQL subs, but it was harder to do because we do not have direct control over the event payload that propagates through Redis PubSub. What this MR does:
- Extracts WAL capturing and reading logic into two new concerns:
WalTrackingSender
andWalTrackingReceiver
. -
WalTrackingSender
provides an interface to produce a hash holding the current WAL location reference (a string) for all DBs (main
,ci
) -
WalTrackingReceiver
provides an interface to take that hash and select an up-to-date replica if available. - Extends
ActionCableWithLoadBalancing
with that concern, and now only falls back to the primary if the LB could not find a caught up replica. - Refactors the Sidekiq middleware to also use these extracted concerns.
The implementation for ActionCableWithLoadBalancing
is not terribly clean, because graphql-ruby does not provide us with the necessary hooks or interfaces to easily inject custom payloads into events. In order to do that, I had to write a custom Serializer
that decorates the upstream serializer to wrap all events in an envelop that embeds WAL location data and the original payload.
An alternative option would have been to do this at the Action Cable level, which would be the more generic solution but it turned out to be even harder, because GraphQL subs pass events around as strings, not hashes (so we'd have to do some extra parsing) and it also does not allow us to use a custom AC coder for broadcasts, only receivers.
The main change is behind a FF: #408178 (closed)
Screenshots or screen recordings
No replication lag:
EpicIssue Load (0.3ms) SELECT "epic_issues".* FROM "epic_issues" WHERE "epic_issues"."issue_id" = 3 LIMIT 1 /*application:web,correlation_id:1be5a0bc8ce441c8f7d6238bae262dd4,endpoint_id:graphql:issuableEpicUpdated,db_config_name:main_replica,line:/ee/app/models/ee/issue.rb:314:in `has_epic?'*/
Note how it uses a replica (db_config_name:main_replica
).
With replication lag:
EpicIssue Load (0.8ms) SELECT "epic_issues".* FROM "epic_issues" WHERE "epic_issues"."issue_id" = 3 LIMIT 1 /*application:web,correlation_id:72d7d468228dec10640d10fbf6460f12,endpoint_id:graphql:issuableEpicUpdated,db_config_name:main,line:/ee/app/models/ee/issue.rb:314:in `has_epic?'*/
Note how it falls back to the primary (db_config_name:main
).
How to set up and validate locally
To test this, simulate replication lag:
- GDK: https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/database_load_balancing.md#simulating-replication-delay
- GCK: https://gitlab.com/gitlab-org/gitlab-compose-kit/-/blob/master/README.md#postgresql-with-streamingphysical-replication
Toggle FF graphql_subs_lb
on/off:
- off: should use primary always
- on: should use replicas if there is no lag, otherwise primary
You can then see in the PG marginalia for GraphQL sub queries that it now goes to a replica. You can use any feature to test this that results in a GraphQL subscription to fire. Examples are:
- labels on the issue page sidebar
- epic link on the issue page sidebar
- assignees on the issue page sidebar
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #402999 (closed)