Configurable online GC review delay

Context

As described in the spec, online GC relies on a set of database triggers and functions which take care of queueing blobs and manifests for review.

Problem

By default, the review of these blobs and manifests is set to one day ahead of the queueing time (using the review_after column) and this is not configurable:

Function/trigger	Insert in queue	`review_after`
`gc_track_blob_uploads`	`gc_blob_review_queue`	Default (1 day), incremented by another 1 day in case of conflict.
`gc_track_deleted_manifests`	`gc_blob_review_queue`	Default (1 day), incremented by another 1 day in case of conflict.
`gc_track_deleted_layers`	`gc_blob_review_queue`	Default (1 day), incremented by another 1 day in case of conflict.
`gc_track_manifest_uploads`	`gc_manifest_review_queue`	Default (1 day).
`gc_track_deleted_manifest_lists`	`gc_manifest_review_queue`	Default (1 day), incremented by another 1 day in case of conflict.
`gc_track_deleted_tags`	`gc_manifest_review_queue`	Default (1 day), incremented by another 1 day in case of conflict.
`gc_track_switched_tags`	`gc_manifest_review_queue`	Default (1 day), incremented by another 1 day in case of conflict.

Considering this:

As we move through the gradual rollout of the new registry with online GC for GitLab.com, gradually increasing the load on the application, we may find ourselves in need of adjusting the default value for review_after. Frequently changing the default value of review_after (alter table) with a database migration may be problematic.
As we gain additional insight on how online GC performs under load, we may also find it useful to fine-tune the default review_after individually for each artifact (blob or manifest) and operation (queued in response to a manifest delete, a tag delete, etc.) pair. Having to drop and recreate the online GC functions to use a non-default value for review_after may be problematic.
Last but not least, for QA tests, it would be useful to have no review delay. If we could "disable" it (set the default review_after to NOW()) using an application/environment configuration, we could perform a series of API requests that would let us validate the online GC behavior. For example, we could upload a series of images, and then delete all tags for a few of them. After some seconds, online GC should have removed the dangling images and we could assert that by trying to pull them, which should fail. Currently, there is no way we could do this.

Possible solution

For 1 and 2, we could probably have a gc_settings table (similar to how GitLab Rails uses application_settings). There we could have a column for the default delay (either a single one or multiple, per artifact/operation). Each function would then source the proper review_after to use from this table. A regular database migration would fill this table with default values. To change them we would need a database migration (simple update) as well, or:

For problem 3 and as a possible addition for 1 and 2 as well, we could read the desired review delay settings from the application configuration file at boot time and update the gc_settings table accordingly (if any custom values were set), or leave it with its default values. This can create some concurrency problems in clustered environments, as we would have several instances trying to do the same operation on gc_settings. To tackle this we could either use a lock mechanism or use a randomized jitter before the update operation, letting the "last write win".

Edited Feb 17, 2021 by João Pereira