Add last_seat_refresh_at to gitlab subscriptions
What does this MR do and why?
Adds a new column to the GitlabSubscriptions
table, last_seat_refresh_at
.
TLDR; we need it for tracking the status of updated rows when using a limited capacity worker to refresh seat attributes.
For more info/context, see below.
bin/rails db:migrate
main: == 20221114145103 AddLastSeatRefreshAtToGitlabSubscriptions: migrating ========
main: -- add_column(:gitlab_subscriptions, :last_seat_refresh_at, :datetime_with_timezone)
main: -> 0.0013s
main: == 20221114145103 AddLastSeatRefreshAtToGitlabSubscriptions: migrated (0.0015s)
ci: == 20221114145103 AddLastSeatRefreshAtToGitlabSubscriptions: migrating ========
ci: -- add_column(:gitlab_subscriptions, :last_seat_refresh_at, :datetime_with_timezone)
ci: -> 0.0007s
ci: == 20221114145103 AddLastSeatRefreshAtToGitlabSubscriptions: migrated (0.0008s)
Refs #334903 (closed)
Background
Every subscription for GitLab.com is represented by a GitlabSubscription
.
The GitlabSubscription
contains 3 key pieces of information:
-
max_seats_used
- the maximum number of billable seats theNamespace
has used -
seats_in_use
- the current number of billable seats theNamespace
is using -
seats_owed
- the number of seats the customer needs to pay for
To keep these attributes up to date, a worker runs every day at midnight UTC that:
- iterates over every single
GitlabSubscription
- refreshes the seat attributes for each one
- updates the DB records via a manual SQL
UPDATE
to be more performant (oneUPDATE
query for each batch of subscriptions)
The Problem
- The worker that runs each day has historically been prone to error (gitlab-com/gl-infra/scalability#1116 (closed)) due to timeouts
- The existing job is very long running, and so is at risk of being interrupted (e.g. pod or process restart), resulting in namespaces not having their seat attributes updated, and it’s time to run will only ever increase as we increase our number of subscriptions on GitLab.com
- The manual SQL means we bypass any callbacks defined in the model
The Solution
The solution is to replace the one job with Limited Capacity jobs: Sidekiq limited capacity worker.
Doing so will allow us to have:
- One quick running job per
GitLabSubscription
- Loop over all
GitlabSubscription
without fear of interruption - Use “normal” update methods and avoid bypassing the regular lifecycle hooks/callbacks
Recalculating the seat attributes is important for billing and usage statistics, so the plan is to add the new limited capacity worker behind a feature flag (rollout issue) so that we can have both running at the same time initially.
Once we have confirmed the new job is working as expected, we can remove the old job and the feature flag.
How will it work?
The limited capacity setup will essentially do the following:
- A cron job will schedule the seat attribute refresh every 6 hours
- The refresh worker will:
- Look for the next
GitlabSubscription
that has not been refreshed in the last 24 hours - Immediately update the last refreshed timestamp (
last_seat_refresh_on
) so that it doesn’t get picked up by a parallel job - Refresh the seats for that subscription
- Look for the next
- The scheduler will queue a new job if there is remaining work and the maximum number of running jobs haven’t already been queued
The MRs
Replacing the existing job involves adding 2 workers and a database change. So to make it easier to review, it’s been split into the following MRs:
Title | Link | Stage |
---|---|---|
Add the required DB column | !103937 (merged) |
|
Add the new LimitedCapacity worker | !104099 (merged) | blocked |
Add the scheduler worker | !104705 (closed) | blocked |
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.