Add limits and throttling to Container Expiration Policies Execution
🌲 Context
With the Container Registry, users have a place to host their images and associated tags. Those tags tend to be accumulated and with time, need to be cleaned up. That's when users can use cleanup policies. Those policies are executed (on a recurrent schedule) by a cron worker. During its execution, the backend will get all the image repositories and their tags and attempt to delete them.
The main issue here is the speed. Deleting tags is a slow operation (for now) as the rails backend has to interact with the container registry.
Currently, cleanup policies are automatically enabled for projects created on 12.8 (and after) but for those before 12.8, it's disabled. It's our aim to enable them for all projects (epic: &2270 (closed)).
In the current implementation, the cleanup policies are executed without any kind of Application Limits . Before enabling the policies for all projects, we have to implement them and throttle the jobs handled by the background workers so that image repositories with a lot of tags (imagine having to delete 200K tags) don't clog the queue. That's the scope of issue #208193 (closed).
Models view
classDiagram
Project "1" --> "1" ContainerExpirationPolicy
Project "1" --> "0..n" ContainerRepository
ContainerRepository "1" .. "0..n" Tag
(Not sure what's happening with the mermaid diagram above, the model names are giantic.
The Tag
is not a model hosted on the rails backend. It lives on the Container Registry.
Note that when the background worker process a ContainerRepository
, it has no idea how many tags there are. In other words, we can't know how much work there is by looking at the data in the rails backend. For that, the rails backend has to contact the Container Registry and this interaction is also a slow operation (for now).
Current Worker/Service view
graph TD
parent([ContainerExpirationPolicyWorker]) --> ceps(ContainerExpirationPolicyService)
ceps --> ccrw([CleanupContainerRepositoryWorker])
api[[API::ProjectContainerRepositories<br />API endpoint]] --> ccrw
ccrw --> services(ServiceLayer for tags deletion)
Currently, we have a cron worker(ContainerExpirationPolicyWorker
) that executes policies on a recurrent schedule by sending them to a service(ContainerExpirationPolicyService
). This service in turn will queue as many jobs(CleanupContainerRepositoryWorker
) as image repositories linked for the given policy. This worker(CleanupContainerRepositoryWorker
) is also used by the API for the delete tags in bulk endpoint.
Finally, the worker(CleanupContainerRepositoryWorker
) will access the necessary services to delete the tags.
Please note MR !36319 (merged) which adds an execution timeout on the service layer dedicated to tags deletion. This limit allow us to send any number of tags to the service layer without fearing a long execution. The service layer will delete as much as it can and just stop when the timeout is hit. This timeout is gated behind a feature flag .
🤔 What does this MR do?
The overall idea is that: we will receive some policies to execute without knowing how much work there is to do. Those policies run on a regular basis. Don't try to delete all the tags in one pass but delete as much as possible within the Application Limits .
This MR will modify the two workers so that we can advantage of the LimitedCapacity::Worker
concern. This concern imposes a capacity on the number of jobs enqueued. This way, the queue can't be clogged by slow running jobs.
We will add a new column on ContainerRepository
so that we can mark the repository with a cleanup status:
-
cleanup_scheduled
: the linked policy has just been executed. This repository should be picked up as soon as possible. -
cleanup_unfinished
: a first cleanup pass has been done but the execution timeout has been hit and so the cleaning has been stopped. This repository should be picked up for cleaning if there is nocleanup_scheduled
repository. -
cleanup_ongoing
: a worker node is cleaning up this repository. -
cleanup_unscheduled
: previous cleanups have been completed and there is nothing to do with this repository.
We can see that we have two priorities in those states. A high one(cleanup_scheduled
), these repositories should be cleaned up as soon as possible. We also have a lower priority(cleanup_unfinished
), these repositories should be processed only if all the high priority cleanings have been done.
The above is to ensure fairness between all the repositories. We can't imagine having an image repository with thousands and thousands of tags keeping all the production resources for its cleaning.
For an additional safety net, we will use a feature flag allowing us to toggle between the old behavior (without limits) and the new one (with limits applied).
⚔ Technical details and design choices
- We need a way to mark
ContainerRepository
as "ready for cleanup" or "cleaning ongoing". We choose to implement that by adding anenum
column to the model:expiration_policy_cleanup_status
. - We will need to include the
LimitedCapacity::Worker
concern in theCleanupContainerRepositoryWorker
. Our initial idea was to support the feature flag within the class so that the worker class could handle both cases: without the limits and with the limited capacity. This turned out to be a mess and so, we decided to create a new worker class for the limited capacity feature:ContainerExpirationPolicies::CleanupContainerRepositoryWorker
. - This new worker(
ContainerExpirationPolicies::CleanupContainerRepositoryWorker
) has to manage the cleanup state of theContainerRepository
being processed and this turned out to be complex to handle everything within the same class. So a new service was created:ContainerExpirationPolicies::CleanupService
that will take care to manage the cleanup state and pass the work to the proper service layer part.
Here is the new Worker/Service view:
graph TD
parent([ContainerExpirationPolicyWorker]) --feature flag off--> ceps(ContainerExpirationPolicyService)
parent --feature flag on--> cep_ccrw([ContainerExpirationPolicies::CleanupContainerRepositoryWorker])
cep_ccrw --> cep_cleanup_service(ContainerExpirationPolicies::CleanupService)
cep_cleanup_service --> services
ceps --> ccrw([CleanupContainerRepositoryWorker])
api[[API::ProjectContainerRepositories<br />API endpoint]] --> ccrw
ccrw --> services(ServiceLayer for tags deletion)
-
ContainerExpirationPolicyWorker
will now decide which path to use based on the feature flag -
ContainerExpirationPolicies::CleanupContainerRepositoryWorker
will useContainerExpirationPolicies::CleanupService
for managing theexpiration_policy_cleanup_status
on theContainerRepository
- For the actual delete,
ContainerExpirationPolicies::CleanupService
will delegate to the exact same part in the service layer as if the feature flag is off. - The
LimitedCapacity::Worker
concern needs acapacity
which defines how many jobs may be enqueued for this work. In this case, we delegate that to an application setting added by this MR:container_registry_expiration_policies_worker_capacity
📚 Other considerations
- documentation changes are not needed here as all the changes are gated behind feature flag
Screenshots
Here are some logs from my local testing: https://gitlab.com/-/snippets/2020921
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry - [-] Documentation (if required)
-
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides - [-] Separation of EE specific content
Availability and Testing
- [-] Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
- [-] Tested in all supported browsers
- [-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
- [-] Label as security and @ mention
@gitlab-com/gl-security/appsec
- [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
- [-] Security reports checked/validated by a reviewer from the AppSec team
🗄 database Review
⬆ Migration Up
== 20200920130356 AddContainerExpirationPolicyWorkerSettingsToApplicationSettings: migrating
-- column_exists?(:application_settings, :container_registry_expiration_policies_worker_capacity)
-> 0.0295s
-- add_column(:application_settings, :container_registry_expiration_policies_worker_capacity, :integer, {:default=>0, :null=>false})
-> 0.0022s
== 20200920130356 AddContainerExpirationPolicyWorkerSettingsToApplicationSettings: migrated (0.0318s)
== 20200928123510 AddExpirationPolicyCleanupStatusToContainerRepositories: migrating
-- column_exists?(:container_repositories, :expiration_policy_cleanup_status)
-> 0.0010s
-- add_column(:container_repositories, :expiration_policy_cleanup_status, :integer, {:limit=>2, :default=>0, :null=>false})
-> 0.0014s
-- transaction_open?()
-> 0.0000s
-- index_exists?(:container_repositories, [:expiration_policy_cleanup_status, :expiration_policy_started_at], {:name=>"idx_container_repositories_on_exp_cleanup_status_and_start_date", :algorithm=>:concurrently})
-> 0.0021s
-- execute("SET statement_timeout TO 0")
-> 0.0004s
-- add_index(:container_repositories, [:expiration_policy_cleanup_status, :expiration_policy_started_at], {:name=>"idx_container_repositories_on_exp_cleanup_status_and_start_date", :algorithm=>:concurrently})
-> 0.0025s
-- execute("RESET ALL")
-> 0.0004s
== 20200928123510 AddExpirationPolicyCleanupStatusToContainerRepositories: migrated (0.0085s)
== 20201012122428 AddContainerRegistryExpirationPoliciesWorkerCapacityConstraint: migrating
-- transaction_open?()
-> 0.0000s
-- current_schema()
-> 0.0002s
-- execute("ALTER TABLE application_settings\nADD CONSTRAINT app_settings_registry_exp_policies_worker_capacity_positive\nCHECK ( container_registry_expiration_policies_worker_capacity >= 0 )\nNOT VALID;\n")
-> 0.0012s
-- current_schema()
-> 0.0002s
-- execute("ALTER TABLE application_settings VALIDATE CONSTRAINT app_settings_registry_exp_policies_worker_capacity_positive;")
-> 0.0008s
== 20201012122428 AddContainerRegistryExpirationPoliciesWorkerCapacityConstraint: migrated (0.0108s)
== 20201016074302 AddIndexProjectIdAndIdToContainerRepositories: migrating ====
-- transaction_open?()
-> 0.0000s
-- index_exists?(:container_repositories, [:project_id, :id], {:name=>"index_container_repositories_on_project_id_and_id", :algorithm=>:concurrently})
-> 0.0030s
-- execute("SET statement_timeout TO 0")
-> 0.0002s
-- add_index(:container_repositories, [:project_id, :id], {:name=>"index_container_repositories_on_project_id_and_id", :algorithm=>:concurrently})
-> 0.0052s
-- execute("RESET ALL")
-> 0.0003s
== 20201016074302 AddIndexProjectIdAndIdToContainerRepositories: migrated (0.0091s)
⬇ Migration Down
== 20201012122428 AddContainerRegistryExpirationPoliciesWorkerCapacityConstraint: reverting
-- execute("ALTER TABLE application_settings\nDROP CONSTRAINT IF EXISTS app_settings_registry_exp_policies_worker_capacity_positive\n")
-> 0.0008s
== 20201012122428 AddContainerRegistryExpirationPoliciesWorkerCapacityConstraint: reverted (0.0046s)
== 20200928123510 AddExpirationPolicyCleanupStatusToContainerRepositories: reverting
-- transaction_open?()
-> 0.0000s
-- index_exists?(:container_repositories, [:expiration_policy_cleanup_status, :expiration_policy_started_at], {:name=>"idx_container_repositories_on_exp_cleanup_status_and_start_date", :algorithm=>:concurrently})
-> 0.0033s
-- execute("SET statement_timeout TO 0")
-> 0.0001s
-- remove_index(:container_repositories, {:name=>"idx_container_repositories_on_exp_cleanup_status_and_start_date", :algorithm=>:concurrently, :column=>[:expiration_policy_cleanup_status, :expiration_policy_started_at]})
-> 0.0066s
-- execute("RESET ALL")
-> 0.0002s
-- column_exists?(:container_repositories, :expiration_policy_cleanup_status)
-> 0.0016s
-- remove_column(:container_repositories, :expiration_policy_cleanup_status)
-> 0.0008s
== 20200928123510 AddExpirationPolicyCleanupStatusToContainerRepositories: reverted (0.0131s)
== 20200920130356 AddContainerExpirationPolicyWorkerSettingsToApplicationSettings: reverting
-- column_exists?(:application_settings, :container_registry_expiration_policies_worker_capacity)
-> 0.0313s
-- remove_column(:application_settings, :container_registry_expiration_policies_worker_capacity)
-> 0.0010s
== 20200920130356 AddContainerExpirationPolicyWorkerSettingsToApplicationSettings: reverted (0.0324s)
== 20201016074302 AddIndexProjectIdAndIdToContainerRepositories: reverting ====
-- transaction_open?()
-> 0.0000s
-- index_exists?(:container_repositories, [:project_id, :id], {:name=>"index_container_repositories_on_project_id_and_id", :algorithm=>:concurrently})
-> 0.0038s
-- execute("SET statement_timeout TO 0")
-> 0.0002s
-- remove_index(:container_repositories, {:name=>"index_container_repositories_on_project_id_and_id", :algorithm=>:concurrently, :column=>[:project_id, :id]})
-> 0.0052s
-- execute("RESET ALL")
-> 0.0002s
== 20201016074302 AddIndexProjectIdAndIdToContainerRepositories: reverted (0.0099s)
📊 Query plans
See comment !40740 (comment 422135582)
🔒
Database See comment !40740 (comment 422142011)