Different fixes for cleanup policies for container images (!114609) · Merge requests · GitLab.org / GitLab

David Fernandez requested to merge 395749-multiple-fixes into master Mar 14, 2023

🛰 Context

In Reduce CPU utilization on the postgres primary ... (#395749 - closed), we had an issue with cleanup policies for container repositories.

Cleanup policies are essentially background jobs that use parameters to determine which container repository tags can be automatically removed. The issue was triggering an infinite loop on background jobs.

We deployed a first fix. That helped but we still noticed a small loop. This MR aims to fix that remaining issue.

The background classes that execute cleanups heavily rely on the limited capacity concern. At the heart of this concern, we have a job that needs to:

Get the next element to work on.
Work on that element.
Re enqueue itself if there is more work to do.

From our observations, it seems that we have cleanups that will be returned in (3.) but not in (1.). This creates a loop of no op executions. That is bad 😱

🔬 What does this MR do and why?

We're going to deploy 3 fixes in this MR. The goal is to self heal those strange stuck cleanups and avoid having new ones. Here are the fixes:

When evaluating the backlog (1. and 3. above), use the exact same queries. This is to guarantee that if (3.) finds something, then (1.) will return it.
When a cleanup is started, update the cleanup status and start timestamp in the single same UPDATE query.
- This is to avoid those stuck cleanups.
Update the stale cleanups detection so that the current stuck cleanups we have on gitlab.com will get unstuck.

🔭 Screenshots or screen recordings

Mmmmh, you know, background jobs so no UI 🤷

⚙ How to set up and validate locally

Have gdk ready with the container registry.

Let's create a dummy container repository with a rails console:

repository = FactoryBot.create(:container_repository, project: Project.first)

Enable a cleanup policy for that project: <project_url>/-/settings/packages_and_registries/cleanup_image_tags:

Let's test the 3 fixes separately.

1️⃣ Same queries for all limited capacity functions

Let's update the repository to create a stuck one:

repository.update_column(:expiration_policy_cleanup_status, :cleanup_ongoing)
repository.project.container_expiration_policy.update_column(:next_run_at, 2.minutes.ago)

Let's check if there is work to do in cleanups (that's query 3.)

ContainerExpirationPolicies::CleanupContainerRepositoryWorker.new.remaining_work_count
# => 0

No work to do ✅

Now, let's check the next repository to will be picked up (that's query 1.)
```
ContainerExpirationPolicies::CleanupContainerRepositoryWorker.new.send(:container_repository)
# => nil
```
- No container repository to be picked up ✅

2️⃣ Update cleanup status and start time in the same query

Remove the previous container repository with ContainerRepository.destroy! and recreate one.

ContainerRepository.last.destroy!
repository = FactoryBot.create(:container_repository, project: Project.first)

Make sure that the policy is runnable:

repository.project.container_expiration_policy.update_column(:next_run_at, 2.minutes.ago)

Run the job manually:

ContainerExpirationPolicies::CleanupContainerRepositoryWorker.new.perform_work

In the SQL queries that are logged, the following one should happen:

UPDATE "container_repositories" SET "updated_at" = '2023-03-14 16:40:19.547469', "expiration_policy_started_at" = '2023-03-14 16:40:19.546074', "expiration_policy_cleanup_status" = 3 WHERE "container_repositories"."id" = 22

The status and the started at timestamp are updated in a single query ✅

3️⃣ Detect stale ongoing cleanups without a started at timestamp

Remove the previous container repository with ContainerRepository.destroy! and recreate one. Make it stuck in ongoing.

ContainerRepository.last.destroy!
repository = FactoryBot.create(:container_repository, project: Project.first)
repository.update_column(:expiration_policy_cleanup_status, :cleanup_ongoing)

We're going to use a cron worker that fixes stale ongoing cleanups and enqueue cleanup jobs. For the purpose of this test, we need to disable the "enqueue cleanups jobs" part.
- In file app/workers/container_expiration_policy_worker.rb, comment out the ContainerExpirationPolicies::CleanupContainerRepositoryWorker.perform_with_capacity line.
Run the cron worker that detect stale ongoing cleanups and put them as unfinished:
```
ContainerExpirationPolicyWorker.new.perform
```

Let's check our repository:

repository.reload
=> #<ContainerRepository:0x000000013b5ffc80
 id: 27,
 project_id: 1,
 name: "test_image_1",
 created_at: Tue, 14 Mar 2023 16:53:13.978734000 UTC +00:00,
 updated_at: Tue, 14 Mar 2023 16:53:13.978734000 UTC +00:00,
 status: nil,
 expiration_policy_started_at: nil,
 expiration_policy_cleanup_status: "cleanup_unfinished",
 expiration_policy_completed_at: nil,
 migration_pre_import_started_at: nil,
 migration_pre_import_done_at: nil,
 migration_import_started_at: nil,
 migration_import_done_at: nil,
 migration_aborted_at: nil,
 migration_skipped_at: nil,
 migration_retries_count: 0,
 migration_skipped_reason: nil,
 migration_state: "default",
 migration_aborted_in_state: nil,
 migration_plan: nil,
 last_cleanup_deleted_tags_count: nil,
 delete_started_at: nil,
 status_updated_at: nil,
 verification_checksum: nil>

-> the repository is put in unfinished. It will picked up by cleanup workers when they have time to do so. ✅

🚀 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

💾 Database analysis

Edited Mar 15, 2023 by David Fernandez

Different fixes for cleanup policies for container images