Vijay Hawoldar requested to merge vij-add-refresh-column into master Nov 14, 2022

What does this MR do and why?

Adds a new column to the GitlabSubscriptions table, last_seat_refresh_at.

TLDR; we need it for tracking the status of updated rows when using a limited capacity worker to refresh seat attributes.

For more info/context, see below.

bin/rails db:migrate
main: == 20221114145103 AddLastSeatRefreshAtToGitlabSubscriptions: migrating ========
main: -- add_column(:gitlab_subscriptions, :last_seat_refresh_at, :datetime_with_timezone)
main:    -> 0.0013s
main: == 20221114145103 AddLastSeatRefreshAtToGitlabSubscriptions: migrated (0.0015s)

ci: == 20221114145103 AddLastSeatRefreshAtToGitlabSubscriptions: migrating ========
ci: -- add_column(:gitlab_subscriptions, :last_seat_refresh_at, :datetime_with_timezone)
ci:    -> 0.0007s
ci: == 20221114145103 AddLastSeatRefreshAtToGitlabSubscriptions: migrated (0.0008s)

Refs #334903 (closed)

Background

Every subscription for GitLab.com is represented by a GitlabSubscription .

The GitlabSubscription contains 3 key pieces of information:

max_seats_used - the maximum number of billable seats the Namespace has used
seats_in_use - the current number of billable seats the Namespace is using
seats_owed - the number of seats the customer needs to pay for

To keep these attributes up to date, a worker runs every day at midnight UTC that:

iterates over every single GitlabSubscription
refreshes the seat attributes for each one
updates the DB records via a manual SQL UPDATE to be more performant (one UPDATE query for each batch of subscriptions)

The Problem

The worker that runs each day has historically been prone to error (gitlab-com/gl-infra/scalability#1116 (closed)) due to timeouts
The existing job is very long running, and so is at risk of being interrupted (e.g. pod or process restart), resulting in namespaces not having their seat attributes updated, and it’s time to run will only ever increase as we increase our number of subscriptions on GitLab.com
The manual SQL means we bypass any callbacks defined in the model

The Solution

The solution is to replace the one job with Limited Capacity jobs: Sidekiq limited capacity worker.

Doing so will allow us to have:

One quick running job per GitLabSubscription
Loop over all GitlabSubscription without fear of interruption
Use “normal” update methods and avoid bypassing the regular lifecycle hooks/callbacks

🎉

Recalculating the seat attributes is important for billing and usage statistics, so the plan is to add the new limited capacity worker behind a feature flag (rollout issue) so that we can have both running at the same time initially.

Once we have confirmed the new job is working as expected, we can remove the old job and the feature flag.

How will it work?

The limited capacity setup will essentially do the following:

A cron job will schedule the seat attribute refresh every 6 hours
The refresh worker will:
1. Look for the next GitlabSubscription that has not been refreshed in the last 24 hours
2. Immediately update the last refreshed timestamp (last_seat_refresh_on) so that it doesn’t get picked up by a parallel job
3. Refresh the seats for that subscription
The scheduler will queue a new job if there is remaining work and the maximum number of running jobs haven’t already been queued

The MRs

Replacing the existing job involves adding 2 workers and a database change. So to make it easier to review, it’s been split into the following MRs:

Title	Link	Stage
Add the required DB column	!103937 (merged)	👈🏽 you are here
Add the new LimitedCapacity worker	!104099 (merged)	blocked
Add the scheduler worker	!104705 (closed)	blocked

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited Nov 30, 2022 by Vijay Hawoldar

Add last_seat_refresh_at to gitlab subscriptions