Skip to content

Add last_seat_refresh_at to gitlab subscriptions

Vijay Hawoldar requested to merge vij-add-refresh-column into master

What does this MR do and why?

Adds a new column to the GitlabSubscriptions table, last_seat_refresh_at.

TLDR; we need it for tracking the status of updated rows when using a limited capacity worker to refresh seat attributes.

For more info/context, see below.

bin/rails db:migrate
main: == 20221114145103 AddLastSeatRefreshAtToGitlabSubscriptions: migrating ========
main: -- add_column(:gitlab_subscriptions, :last_seat_refresh_at, :datetime_with_timezone)
main:    -> 0.0013s
main: == 20221114145103 AddLastSeatRefreshAtToGitlabSubscriptions: migrated (0.0015s)

ci: == 20221114145103 AddLastSeatRefreshAtToGitlabSubscriptions: migrating ========
ci: -- add_column(:gitlab_subscriptions, :last_seat_refresh_at, :datetime_with_timezone)
ci:    -> 0.0007s
ci: == 20221114145103 AddLastSeatRefreshAtToGitlabSubscriptions: migrated (0.0008s)

Refs #334903 (closed)

Background

Every subscription for GitLab.com is represented by a GitlabSubscription .

The GitlabSubscription contains 3 key pieces of information:

  1. max_seats_used - the maximum number of billable seats the Namespace has used
  2. seats_in_use - the current number of billable seats the Namespace is using
  3. seats_owed - the number of seats the customer needs to pay for

To keep these attributes up to date, a worker runs every day at midnight UTC that:

  • iterates over every single GitlabSubscription
  • refreshes the seat attributes for each one
  • updates the DB records via a manual SQL UPDATE to be more performant (one UPDATE query for each batch of subscriptions)

The Problem

  1. The worker that runs each day has historically been prone to error (gitlab-com/gl-infra/scalability#1116 (closed)) due to timeouts
  2. The existing job is very long running, and so is at risk of being interrupted (e.g. pod or process restart), resulting in namespaces not having their seat attributes updated, and it’s time to run will only ever increase as we increase our number of subscriptions on GitLab.com
  3. The manual SQL means we bypass any callbacks defined in the model

The Solution

The solution is to replace the one job with Limited Capacity jobs: Sidekiq limited capacity worker.

Doing so will allow us to have:

  1. One quick running job per GitLabSubscription
  2. Loop over all GitlabSubscription without fear of interruption
  3. Use “normal” update methods and avoid bypassing the regular lifecycle hooks/callbacks

🎉

Recalculating the seat attributes is important for billing and usage statistics, so the plan is to add the new limited capacity worker behind a feature flag (rollout issue) so that we can have both running at the same time initially.

Once we have confirmed the new job is working as expected, we can remove the old job and the feature flag.

How will it work?

The limited capacity setup will essentially do the following:

  1. A cron job will schedule the seat attribute refresh every 6 hours
  2. The refresh worker will:
    1. Look for the next GitlabSubscription that has not been refreshed in the last 24 hours
    2. Immediately update the last refreshed timestamp (last_seat_refresh_on) so that it doesn’t get picked up by a parallel job
    3. Refresh the seats for that subscription
  3. The scheduler will queue a new job if there is remaining work and the maximum number of running jobs haven’t already been queued

The MRs

Replacing the existing job involves adding 2 workers and a database change. So to make it easier to review, it’s been split into the following MRs:

Title Link Stage
Add the required DB column !103937 (merged) 👈🏽 you are here
Add the new LimitedCapacity worker !104099 (merged) blocked
Add the scheduler worker !104705 (closed) blocked

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Vijay Hawoldar

Merge request reports

Loading