Fix the bug that DB multi-request stickiness doesn't stick as expected
What does this MR do?
Fix the bug mentioned in gitlab-com/gl-infra/scalability#1006 (closed). In summary, Our DB Load Balancing layer has a special sticking mechanism spreading across requests. This mechanism ensures the read consistency caused by replicas' replication lag. This mechanism forces the next requests stick to the primary by storing the last write location of postgres into a redis key determined by a pair [namespace, id]
. This is how to register a sticky object:
def current_job
id = params[:id]
if id
::Gitlab::Database::LoadBalancing::RackMiddleware
.stick_or_unstick(env, :build, id)
end
super
end
Our current implementation is storing just one latest namespace and id:
def self.stick_or_unstick(env, namespace, id)
return unless LoadBalancing.enable?
Sticking.unstick_or_continue_sticking(namespace, id)
env[STICK_OBJECT] = [namespace, id]
end
That makes multi-request stickiness doesn't stick as expected if an endpoint touches multiple objects. This MR is to store all namespaces and ids ever recorded into a set. The next requests iterate and check for all the write locations instead of the latest one. In detail:
- After a request is done, the write locations are written into multiple keys determined by recorded namespaces and ids. As the write location is fetched from Postgres at this stage, all of the locations are all equal or greater than the write location when the actual writes are performed, hence if the following requests depend on any recorded locations, the consistency is guaranteed.
- Before a request, the middleware checks if any of the replica catches up with the recorded write location from the watching stick objects. Therefore some cases to consider:
- If the request doesn't watch any recorded stick object, it means the request is irrelevant to the previous ones, we can use any replica.
- If the request watches just a subset of the recorded stick objects, the loop in unstick_or_continue_sticking forces the session to stick to the primary if any of them doesn't catch up. It may be defensive, but safe for the consistency.
In conclusion, the read consistency is strengthen after this change.
Does this MR meet the acceptance criteria?
Conformity
-
📋 Does this MR need a changelog?-
I have included a changelog entry. -
I have not included a changelog entry because _____.
-
-
Documentation (if required) -
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. -
Tested in all supported browsers -
Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
-
Label as security and @ mention @gitlab-com/gl-security/appsec
-
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team