Cleanup policies: put the unfinished cleanup state in case of an error (!53853) · Merge requests · GitLab.org / GitLab

David Fernandez requested to merge 10io-error-handling-in-container-expiration-policies-cleanup-service into master Feb 10, 2021

🐨 Context

GitLab offers a Container Registry feature where users can host their images and tags. With time, tags are accumulated and those take physical space in object storage.

In order to keep the object storage file system usage in check, we introduced cleanup policies. They are executed given a frequency. On each execution, they will build a list of tags to delete and delete them.

Now container repository tags don't live in the database (there is no rails model for them), instead rails will directly query the Container Registry using a dedicated API. The same goes with the delete operation: it's a DELETE call on the API.

Given the amount of cleanups(internal) that the backend has to process, we put some Application Limits around the background worker. Among other things, we used the limited capacity worker concern. The main idea is that a cleanup policy can be really long to execute so don't try to do it all at once but stop the cleanup to resume it at a later time.

As such, a container repository have cleanup states. Here are the relevant ones for this MR:

scheduled: the cleanup is scheduled and will be picked up by workers asap.
ongoing: the cleanup is ongoing.
unscheduled: the cleanup has fully completed.
unfinished: the cleanup is taking so much time that it has to stop.
- This is to preserve resources and also to avoid that heavy repositories (with 10K+ tags to delete) "lock" the background worker.
- Those cleanups are resumed when the backend has the time/resources for it.

Without going into the details of how a cleanup policy is executed, it has to be noted that we can have a full range of errors, including network errors. When those happen, the errors bubble up and the corresponding worker will stop in an error.

This is something we want to keep. Let's say that the limited capacity is set to 5, it means that at all times, we have max 5 workers processing cleanup policies at the same time. If an error occurs, it's a good idea to let the worker die so that we have less load on the Container Registry.

The issue is that the container repository will stay in ongoing. As such, the backend will not detect them as partially cleaned up and will not attempt to resume the cleaning.

🔬 What does this MR do?

In https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/services/container_expiration_policies/cleanup_service.rb, detect erroneous situations and put the container repository in the unfinished state.

This service is part of the Application Limits we put in place for cleanup policies and they are for now behind a feature flag while we tweak them for gitlab.com. Having said that, self-managed users can enable them and play with those limits.

In addition, this is more Stuff that should Just Work than anything so I don't think any changelog entry or documentation change here.

🖼 Screenshots (strongly suggested)

N / A

🔩 Does this MR meet the acceptance criteria?

🍩 Conformity

⛳ Availability and Testing

[-] Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
[-] Tested in all supported browsers
[-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done

🚓 Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

[-] Label as security and @ mention @gitlab-com/gl-security/appsec
[-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
[-] Security reports checked/validated by a reviewer from the AppSec team

Edited Feb 11, 2021 by David Fernandez

Cleanup policies: put the unfinished cleanup state in case of an error