Add limited capacity job to destroy container repositories
🗄 Context
Users can host container repositories in their Projects using the GitLab Container Registry.
The modeling can be simplified with:
flowchart LR
p(Project)--1:n--- cr(Container Repository)
cr--1:n--- t(tag)
Easy, right? Well, we have a few challenges (simplified):
-
ContainerRepository
data is hosted on the rails backend and the container registry. -
Tag
on the other hand, only exists in the container registry.
When we read a container repository on the rails side, we can't know in advance how many tags we have there. To know that, we need to call the container registry API to have the list of tags.
Now, let's say that a user clicks on the destroy button of a container repository on the rails side. We have a few things to do to complete this operation (simplified):
- Delete all tags.
- We need to call one
DELETE
endpoint per tag here as the container registry API doesn't have a delete tags in bulk endpoint (yet).
- We need to call one
- Delete the container repository.
- We have to call one
DELETE
endpoint in the container registry API. - We have to remove the related row from the database.
- We have to call one
The above is quite involved, so this operation is delayed to a background worker.
The current worker (DeleteContainerRepository
) will simply walk through steps (1.) and (2.).
Now, on gitlab.com we have some heavy container repositories (with close to 100 000
tags). That step (1.) will certainly take time. On top of that, (1.) is doing many network requests (recall that DELETE
request per tag) to the container registry that can fail due to restarts, hiccups or other. As such, (1.) have some good chances to fail.
The problem with that is that the current implementation is ignoring some some of those failures and still executing (2.)
Another problem is that the worker could be terminated due to a long running job and will never retried the delete operations. Container repositories will be marked as "pending destruction" in the UI as we have a status
field on the container repository to indicate if a repository is being deleted or not.
In very short words, (1.) is not reliable and causes quite a few issues. This is issue #217702 (closed).
🚑 Limited capacity jobs to the rescue!
The main idea to tackle those problems is to have a job that can be interrupted, killed, stopped, whatever. It doesn't matter much, the delete operation will be resumed.
To implement that, we're going to leverage a limited capacity job. It's responsibility will be quite simple:
- Take the next pending destruction container repository, exits if none.
- Loop on tags and delete them while limiting the execution time.
- If (2.) succeeds, destroy the container repository.
- Re-enqueue itself (this is automatically done as part of the limited capacity worker).
Now, (2.) can be stopped or interrupted. That's fine. As long as we keep the container repository as pending destruction
, the delete operation will be resumed at a later time.
In other words, this job will loop non stop until all pending destruction
container repositories are processed (eg. removed).
That's nice and cool but how do we kick start the loop?
This will be done with a cron job.
The beauty of this approach is that any web request deleting a container repository doesn't have to enqueue any worker. Marking the container repository as pending destruction
is enough. The two jobs will guarantee that it will be picked up for processing.
✂ MRs split
The entire change was a bit too big for my taste in a single MR. So, I splitted the work in several MRs:
- The limited capacity job and database changes.
👈 You're here. - The cron job and the feature flag.
- feature flag cleanup along with the old approach of destroying container repositories.
🔬 What does this MR do and why?
-
database changes
- Add a new column
delete_started_at
to tablecontainer_repositories
- Add a new column
- Model changes
- Add a new status
delete_ongoing
toContainerRepository
. This is used to make sure that 2 limited capacity jobs don't pick up the same container repository. - Add helper functions to
ContainerRepository
to start and reset the delete phase.
- Add a new status
- Background jobs
- Add the
ContainerRegistry::DeleteContainerRepositoryWorker
job which pick ups the nextdelete_scheduled
container repository and start removing it.
- Add the
📺 Screenshots or screen recordings
n / a
⚙ How to set up and validate locally
- Have GDK ready with the container registry setup.
-
Setup your
$ docker
client.
Time to create a container repository with a few tags. Create a Dockerfile
with:
`Dockerfile`
FROM scratch
ADD seed /
Now, let's create a container repository with many tags:
for i in {1..100}
do
docker build -t <registry url>/<project-path>/registry-test:$i .
docker push <registry url>/<project-path>/registry-test:$i
done
I used 100
for the amount of tags.
Everything is ready to play around. In a rails console:
- Get the container repository:
repo = ContainerRepository.last
- Check that we have many tags:
repo.tags_count # should be the amount of tags you created
- We need to categorize our repository as
non migrated
:repo.update!(created_at: ::ContainerRepository::MIGRATION_PHASE_1_ENDED_AT - 1.month)
- Basically, the container registry of GDK doesn't handle migrated repositories (yet). As such, we need to make sure that rails treat this as a non migrated repository during the tags cleanup. (migrated repos with the gitlab container registry use a way more efficient way to delete tags).
- Let's mark the repository as
delete_scheduled
:repo.delete_scheduled!
- Now, let's enqueue our limited capacity job:
ContainerRegistry::DeleteContainerRepositoryWorker.perform_with_capacity
- In
log/sidekiq.log
, you should see these lines:{"severity":"INFO","time":"2022-10-25T14:18:50.803Z","retry":0,"queue":"default","backtrace":true,"version":0,"status_expiration":1800,"queue_namespace":"container_repository_delete","class":"ContainerRegistry::DeleteContainerRepositoryWorker","args":[],"jid":"d6585226d7c2127474cc418b","created_at":"2022-10-25T14:18:50.771Z","meta.feature_category":"container_registry","correlation_id":"2bb2bfee97d503f932b200d1232b936f","worker_data_consistency":"always","size_limiter":"validated","enqueued_at":"2022-10-25T14:18:50.800Z","job_size_bytes":2,"pid":86294,"message":"ContainerRegistry::DeleteContainerRepositoryWorker JID-d6585226d7c2127474cc418b: start","job_status":"start","scheduling_latency_s":0.002467} {"severity":"INFO","time":"2022-10-25T14:18:55.963Z","project_id":303,"container_repository_id":136,"container_repository_path":"root/registry-refacto/test2","tags_size_before_delete":99,"deleted_tags_size":99,"meta.feature_category":"container_registry","correlation_id":"2bb2bfee97d503f932b200d1232b936f","meta.caller_id":"ContainerRegistry::DeleteContainerRepositoryWorker","class":"ContainerRegistry::DeleteContainerRepositoryWorker","job_status":"running","queue":"default","jid":"d6585226d7c2127474cc418b","retry":0} {"severity":"INFO","time":"2022-10-25T14:18:55.976Z","retry":0,"queue":"default","backtrace":true,"version":0,"status_expiration":1800,"queue_namespace":"container_repository_delete","class":"ContainerRegistry::DeleteContainerRepositoryWorker","args":[],"jid":"d6585226d7c2127474cc418b","created_at":"2022-10-25T14:18:50.771Z","meta.feature_category":"container_registry","correlation_id":"2bb2bfee97d503f932b200d1232b936f","worker_data_consistency":"always","size_limiter":"validated","enqueued_at":"2022-10-25T14:18:50.800Z","job_size_bytes":2,"pid":86294,"message":"ContainerRegistry::DeleteContainerRepositoryWorker JID-d6585226d7c2127474cc418b: done: 5.17304 sec","job_status":"done","scheduling_latency_s":0.002467,"redis_calls":9,"redis_duration_s":0.0023899999999999998,"redis_read_bytes":211,"redis_write_bytes":1393,"redis_cache_calls":1,"redis_cache_duration_s":0.000233,"redis_cache_read_bytes":202,"redis_cache_write_bytes":55,"redis_queues_calls":4,"redis_queues_duration_s":0.001411,"redis_queues_read_bytes":5,"redis_queues_write_bytes":799,"redis_shared_state_calls":4,"redis_shared_state_duration_s":0.000746,"redis_shared_state_read_bytes":4,"redis_shared_state_write_bytes":539,"db_count":9,"db_write_count":5,"db_cached_count":1,"db_replica_count":0,"db_primary_count":9,"db_main_count":9,"db_main_replica_count":0,"db_replica_cached_count":0,"db_primary_cached_count":1,"db_main_cached_count":1,"db_main_replica_cached_count":0,"db_replica_wal_count":0,"db_primary_wal_count":0,"db_main_wal_count":0,"db_main_replica_wal_count":0,"db_replica_wal_cached_count":0,"db_primary_wal_cached_count":0,"db_main_wal_cached_count":0,"db_main_replica_wal_cached_count":0,"db_replica_duration_s":0.0,"db_primary_duration_s":0.013,"db_main_duration_s":0.013,"db_main_replica_duration_s":0.0,"cpu_s":0.185157,"worker_id":"sidekiq_0","rate_limiting_gates":[],"duration_s":5.17304,"completed_at":"2022-10-25T14:18:55.976Z","load_balancing_strategy":"primary","db_duration_s":0.00403}
- If we check the UI, the container repository is gone
🎉
🏎 MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
💾 Database review
⤴ Migration up
$ rails db:migrate
main: == 20221020124018 AddDeleteStartedAtToContainerRepositories: migrating ========
main: -- add_column(:container_repositories, :delete_started_at, :datetime_with_timezone, {:null=>true, :default=>nil})
main: -> 0.0051s
main: == 20221020124018 AddDeleteStartedAtToContainerRepositories: migrated (0.0056s)
main: == 20221025105205 AddStatusAndIdIndexToContainerRepositories: migrating =======
main: -- transaction_open?()
main: -> 0.0000s
main: -- index_exists?(:container_repositories, [:status, :id], {:name=>"index_container_repositories_on_status_and_id", :where=>"status IS NOT NULL", :algorithm=>:concurrently})
main: -> 0.0117s
main: -- execute("SET statement_timeout TO 0")
main: -> 0.0003s
main: -- add_index(:container_repositories, [:status, :id], {:name=>"index_container_repositories_on_status_and_id", :where=>"status IS NOT NULL", :algorithm=>:concurrently})
main: -> 0.0038s
main: -- execute("RESET statement_timeout")
main: -> 0.0003s
main: == 20221025105205 AddStatusAndIdIndexToContainerRepositories: migrated (0.0231s)
⤵ Migration down
$ rails db:rollback
main: == 20221020124018 AddDeleteStartedAtToContainerRepositories: reverting ========
main: -- remove_column(:container_repositories, :delete_started_at, :datetime_with_timezone, {:null=>true, :default=>nil})
main: -> 0.0164s
main: == 20221020124018 AddDeleteStartedAtToContainerRepositories: reverted (0.0213s)
main: == 20221025105205 AddStatusAndIdIndexToContainerRepositories: reverting =======
main: -- transaction_open?()
main: -> 0.0000s
main: -- index_exists?(:container_repositories, [:status, :id], {:name=>"index_container_repositories_on_status_and_id", :algorithm=>:concurrently})
main: -> 0.0133s
main: -- execute("SET statement_timeout TO 0")
main: -> 0.0004s
main: -- remove_index(:container_repositories, {:name=>"index_container_repositories_on_status_and_id", :algorithm=>:concurrently, :column=>[:status, :id]})
main: -> 0.0110s
main: -- execute("RESET statement_timeout")
main: -> 0.0003s
main: == 20221025105205 AddStatusAndIdIndexToContainerRepositories: reverted (0.0333s)
🚜 Queries Analysis
- select the next container repository pending destruction.
- count how many container repositories pending destruction there are.
We do have single row updates but that are usual container_repository.update_columns
calls. I didn't run an analysis on these queries. Those are standard UPDATE
queries for a single row selected by primary key.