Track event type that led to queueing online GC tasks
Context
Online GC, as described in the specification, operates on top of two database tables that act as review queues: gc_manifest_review_queue
and gc_blob_review_queue
, one for manifests and another for blobs, respectively.
There are multiple API events that may lead to dangling artifacts (manifests and blobs), and therefore a review task is queued in response to each of these events.
For example, a manifest may become dangling when a tag is deleted through the API. In this case, a task is queued to ensure that the manifest that the tag was pointing to still have at least another tag (or another manifest) referencing it, otherwise it should be deleted.
Problem
Right now we don't have visibility over which event led to a task being queued. Knowing which event led to queueing a given task (and the potential artifact deletion later) can facilitate debugging/analysis but also allows us to collect additional metrics.
Solution
-
Add a new event
(text
) column to the GC queue tables (releaseN
). -
Update GC trigger functions (on the database) to fill this column when inserting (or updating on conflict *
) rows in these tables (releaseN
). -
Update GC workers (on the application) to log the type of event for each processed task (release N+1
). -
Update integration tests in registry/datastore/gc_integration_test.go
so that the value of the newevent
column is properly validated. -
Update registry_gc_runs_total
Prometheus metric to include the event type of each task (releaseN+1
). There are only 7 types of events (more on that later), so cardinality should not be a problem. -
Add NOT NULL
constraint to newevent
column and update models accordingly. -
Expand Grafana dashboards to include metrics about the new dangling
andevent
labels.
*
As described in the specification, some GC triggers have an ON CONFLICT DO UPDATE
clause. This is needed because different events may lead to multiple attempts to queue a task for the exact same manifest or blob. Therefore, we will also update the event
of existing tasks in case of conflict. This guarantees that the value of event
is the latest event that led to queueing a task and not the first one.
Events
Manifests
Here we'll list all GC functions/triggers that are responsible for inserting/updating rows on the gc_manifest_review_queue
table (all documented in the specification), as well as the corresponding value to be used for the new event
column:
DB Function | Triggering Event Identifier |
---|---|
gc_track_manifest_uploads |
manifest_upload |
gc_track_deleted_manifest_lists |
manifest_list_delete |
gc_track_deleted_tags |
tag_delete |
gc_track_switched_tags |
tag_switch |
Blobs
Same but for the gc_blob_review_queue
table:
DB Function | Triggering Event Identifier |
---|---|
gc_track_blob_uploads |
blob_upload |
gc_track_deleted_manifests |
manifest_delete |
gc_track_deleted_layers |
layer_delete |