Add limits and throttling to Container Expiration Policies Execution (!40740) · Merge requests · GitLab.org / GitLab

David Fernandez requested to merge 208193-throttle-container-expiration-policy-worker-execution into master Aug 28, 2020

⚠ This MR has a non negligible size and so is its complexity. This MR was not splitted to present all the limit/throttling logic in a single MR so that reviewers can get the whole picture and provide meaningful feedback (more details here: !40393 (comment 407270536)). ⚠

🌲 Context

With the Container Registry, users have a place to host their images and associated tags. Those tags tend to be accumulated and with time, need to be cleaned up. That's when users can use cleanup policies. Those policies are executed (on a recurrent schedule) by a cron worker. During its execution, the backend will get all the image repositories and their tags and attempt to delete them.

The main issue here is the speed. Deleting tags is a slow operation (for now) as the rails backend has to interact with the container registry.

Currently, cleanup policies are automatically enabled for projects created on 12.8 (and after) but for those before 12.8, it's disabled. It's our aim to enable them for all projects (epic: &2270 (closed)).

In the current implementation, the cleanup policies are executed without any kind of Application Limits . Before enabling the policies for all projects, we have to implement them and throttle the jobs handled by the background workers so that image repositories with a lot of tags (imagine having to delete 200K tags) don't clog the queue. That's the scope of issue #208193 (closed).

Models view

classDiagram
    Project "1" --> "1" ContainerExpirationPolicy
    Project "1" --> "0..n" ContainerRepository
    ContainerRepository "1" .. "0..n" Tag

(Not sure what's happening with the mermaid diagram above, the model names are giantic. 🤷 )

The Tag is not a model hosted on the rails backend. It lives on the Container Registry.

Note that when the background worker process a ContainerRepository, it has no idea how many tags there are. In other words, we can't know how much work there is by looking at the data in the rails backend. For that, the rails backend has to contact the Container Registry and this interaction is also a slow operation (for now).

Current Worker/Service view

graph TD
parent([ContainerExpirationPolicyWorker]) --> ceps(ContainerExpirationPolicyService)
ceps --> ccrw([CleanupContainerRepositoryWorker])
api[[API::ProjectContainerRepositories<br />API endpoint]] --> ccrw
ccrw --> services(ServiceLayer for tags deletion)

Currently, we have a cron worker(ContainerExpirationPolicyWorker) that executes policies on a recurrent schedule by sending them to a service(ContainerExpirationPolicyService). This service in turn will queue as many jobs(CleanupContainerRepositoryWorker) as image repositories linked for the given policy. This worker(CleanupContainerRepositoryWorker) is also used by the API for the delete tags in bulk endpoint.

Finally, the worker(CleanupContainerRepositoryWorker) will access the necessary services to delete the tags.

Please note MR !36319 (merged) which adds an execution timeout on the service layer dedicated to tags deletion. This limit allow us to send any number of tags to the service layer without fearing a long execution. The service layer will delete as much as it can and just stop when the timeout is hit. This timeout is gated behind a feature flag .

🤔 What does this MR do?

The overall idea is that: we will receive some policies to execute without knowing how much work there is to do. Those policies run on a regular basis. Don't try to delete all the tags in one pass but delete as much as possible within the Application Limits .

This MR will modify the two workers so that we can advantage of the LimitedCapacity::Worker concern. This concern imposes a capacity on the number of jobs enqueued. This way, the queue can't be clogged by slow running jobs.

We will add a new column on ContainerRepository so that we can mark the repository with a cleanup status:

cleanup_scheduled: the linked policy has just been executed. This repository should be picked up as soon as possible.
cleanup_unfinished: a first cleanup pass has been done but the execution timeout has been hit and so the cleaning has been stopped. This repository should be picked up for cleaning if there is no cleanup_scheduled repository.
cleanup_ongoing: a worker node is cleaning up this repository.
cleanup_unscheduled: previous cleanups have been completed and there is nothing to do with this repository.

We can see that we have two priorities in those states. A high one(cleanup_scheduled), these repositories should be cleaned up as soon as possible. We also have a lower priority(cleanup_unfinished), these repositories should be processed only if all the high priority cleanings have been done.

The above is to ensure fairness between all the repositories. We can't imagine having an image repository with thousands and thousands of tags keeping all the production resources for its cleaning.

For an additional safety net, we will use a feature flag allowing us to toggle between the old behavior (without limits) and the new one (with limits applied).

⚔ Technical details and design choices

We need a way to mark ContainerRepository as "ready for cleanup" or "cleaning ongoing". We choose to implement that by adding an enum column to the model: expiration_policy_cleanup_status.
We will need to include the LimitedCapacity::Worker concern in the CleanupContainerRepositoryWorker. Our initial idea was to support the feature flag within the class so that the worker class could handle both cases: without the limits and with the limited capacity. This turned out to be a mess and so, we decided to create a new worker class for the limited capacity feature: ContainerExpirationPolicies::CleanupContainerRepositoryWorker.
This new worker(ContainerExpirationPolicies::CleanupContainerRepositoryWorker) has to manage the cleanup state of the ContainerRepository being processed and this turned out to be complex to handle everything within the same class. So a new service was created: ContainerExpirationPolicies::CleanupService that will take care to manage the cleanup state and pass the work to the proper service layer part.

Here is the new Worker/Service view:

graph TD
parent([ContainerExpirationPolicyWorker]) --feature flag off--> ceps(ContainerExpirationPolicyService)
parent --feature flag on--> cep_ccrw([ContainerExpirationPolicies::CleanupContainerRepositoryWorker])
cep_ccrw --> cep_cleanup_service(ContainerExpirationPolicies::CleanupService)
cep_cleanup_service --> services
ceps --> ccrw([CleanupContainerRepositoryWorker])
api[[API::ProjectContainerRepositories<br />API endpoint]] --> ccrw
ccrw --> services(ServiceLayer for tags deletion)

ContainerExpirationPolicyWorker will now decide which path to use based on the feature flag
ContainerExpirationPolicies::CleanupContainerRepositoryWorker will use ContainerExpirationPolicies::CleanupService for managing the expiration_policy_cleanup_status on the ContainerRepository
For the actual delete, ContainerExpirationPolicies::CleanupService will delegate to the exact same part in the service layer as if the feature flag is off.
The LimitedCapacity::Worker concern needs a capacity which defines how many jobs may be enqueued for this work. In this case, we delegate that to an application setting added by this MR: container_registry_expiration_policies_worker_capacity

📚 Other considerations

documentation changes are not needed here as all the changes are gated behind feature flag

Screenshots

Here are some logs from my local testing: https://gitlab.com/-/snippets/2020921

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

[-] Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
[-] Tested in all supported browsers
[-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

[-] Label as security and @ mention @gitlab-com/gl-security/appsec
[-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
[-] Security reports checked/validated by a reviewer from the AppSec team

🗄 database Review

⬆ Migration Up

== 20200920130356 AddContainerExpirationPolicyWorkerSettingsToApplicationSettings: migrating
-- column_exists?(:application_settings, :container_registry_expiration_policies_worker_capacity)
   -> 0.0295s
-- add_column(:application_settings, :container_registry_expiration_policies_worker_capacity, :integer, {:default=>0, :null=>false})
   -> 0.0022s
== 20200920130356 AddContainerExpirationPolicyWorkerSettingsToApplicationSettings: migrated (0.0318s)

== 20200928123510 AddExpirationPolicyCleanupStatusToContainerRepositories: migrating
-- column_exists?(:container_repositories, :expiration_policy_cleanup_status)
   -> 0.0010s
-- add_column(:container_repositories, :expiration_policy_cleanup_status, :integer, {:limit=>2, :default=>0, :null=>false})
   -> 0.0014s
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:container_repositories, [:expiration_policy_cleanup_status, :expiration_policy_started_at], {:name=>"idx_container_repositories_on_exp_cleanup_status_and_start_date", :algorithm=>:concurrently})
   -> 0.0021s
-- execute("SET statement_timeout TO 0")
   -> 0.0004s
-- add_index(:container_repositories, [:expiration_policy_cleanup_status, :expiration_policy_started_at], {:name=>"idx_container_repositories_on_exp_cleanup_status_and_start_date", :algorithm=>:concurrently})
   -> 0.0025s
-- execute("RESET ALL")
   -> 0.0004s
== 20200928123510 AddExpirationPolicyCleanupStatusToContainerRepositories: migrated (0.0085s)

== 20201012122428 AddContainerRegistryExpirationPoliciesWorkerCapacityConstraint: migrating
-- transaction_open?()
   -> 0.0000s
-- current_schema()
   -> 0.0002s
-- execute("ALTER TABLE application_settings\nADD CONSTRAINT app_settings_registry_exp_policies_worker_capacity_positive\nCHECK ( container_registry_expiration_policies_worker_capacity >= 0 )\nNOT VALID;\n")
   -> 0.0012s
-- current_schema()
   -> 0.0002s
-- execute("ALTER TABLE application_settings VALIDATE CONSTRAINT app_settings_registry_exp_policies_worker_capacity_positive;")
   -> 0.0008s
== 20201012122428 AddContainerRegistryExpirationPoliciesWorkerCapacityConstraint: migrated (0.0108s)

== 20201016074302 AddIndexProjectIdAndIdToContainerRepositories: migrating ====
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:container_repositories, [:project_id, :id], {:name=>"index_container_repositories_on_project_id_and_id", :algorithm=>:concurrently})
   -> 0.0030s
-- execute("SET statement_timeout TO 0")
   -> 0.0002s
-- add_index(:container_repositories, [:project_id, :id], {:name=>"index_container_repositories_on_project_id_and_id", :algorithm=>:concurrently})
   -> 0.0052s
-- execute("RESET ALL")
   -> 0.0003s
== 20201016074302 AddIndexProjectIdAndIdToContainerRepositories: migrated (0.0091s)

⬇ Migration Down

== 20201012122428 AddContainerRegistryExpirationPoliciesWorkerCapacityConstraint: reverting
-- execute("ALTER TABLE application_settings\nDROP CONSTRAINT IF EXISTS app_settings_registry_exp_policies_worker_capacity_positive\n")
   -> 0.0008s
== 20201012122428 AddContainerRegistryExpirationPoliciesWorkerCapacityConstraint: reverted (0.0046s)

== 20200928123510 AddExpirationPolicyCleanupStatusToContainerRepositories: reverting
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:container_repositories, [:expiration_policy_cleanup_status, :expiration_policy_started_at], {:name=>"idx_container_repositories_on_exp_cleanup_status_and_start_date", :algorithm=>:concurrently})
   -> 0.0033s
-- execute("SET statement_timeout TO 0")
   -> 0.0001s
-- remove_index(:container_repositories, {:name=>"idx_container_repositories_on_exp_cleanup_status_and_start_date", :algorithm=>:concurrently, :column=>[:expiration_policy_cleanup_status, :expiration_policy_started_at]})
   -> 0.0066s
-- execute("RESET ALL")
   -> 0.0002s
-- column_exists?(:container_repositories, :expiration_policy_cleanup_status)
   -> 0.0016s
-- remove_column(:container_repositories, :expiration_policy_cleanup_status)
   -> 0.0008s
== 20200928123510 AddExpirationPolicyCleanupStatusToContainerRepositories: reverted (0.0131s)

== 20200920130356 AddContainerExpirationPolicyWorkerSettingsToApplicationSettings: reverting
-- column_exists?(:application_settings, :container_registry_expiration_policies_worker_capacity)
   -> 0.0313s
-- remove_column(:application_settings, :container_registry_expiration_policies_worker_capacity)
   -> 0.0010s
== 20200920130356 AddContainerExpirationPolicyWorkerSettingsToApplicationSettings: reverted (0.0324s)

== 20201016074302 AddIndexProjectIdAndIdToContainerRepositories: reverting ====
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:container_repositories, [:project_id, :id], {:name=>"index_container_repositories_on_project_id_and_id", :algorithm=>:concurrently})
   -> 0.0038s
-- execute("SET statement_timeout TO 0")
   -> 0.0002s
-- remove_index(:container_repositories, {:name=>"index_container_repositories_on_project_id_and_id", :algorithm=>:concurrently, :column=>[:project_id, :id]})
   -> 0.0052s
-- execute("RESET ALL")
   -> 0.0002s
== 20201016074302 AddIndexProjectIdAndIdToContainerRepositories: reverted (0.0099s)

📊 Query plans

See comment !40740 (comment 422135582)

Database 🔒

See comment !40740 (comment 422142011)

Edited Oct 19, 2020 by David Fernandez

Add limits and throttling to Container Expiration Policies Execution