Add the packages execute cleanup service (!90395) · Merge requests · GitLab.org / GitLab

David Fernandez requested to merge 346153-services-changes-for-background-job into master Jun 16, 2022

🛶 Context

This is an extraction from !89055 (merged) due to a large MR size there.

The Packages Registry works with these core models (simplified):

flowchart LR
    Group -- 1:n --> Project
    Project -- 1:n --> Package
    Package -- 1:n --> PackageFile
    PackageFile -- 1:1 --> os((Object Storage file))

For some package formats, we allow republishing a package version. What happens is that we append the package files to the existing package.

With time, some packages can end up with many package files. All these package files take space on Object Storage.

With #346153 (closed), we're kicking off the work for adding cleanup policies for packages. Basically, users will be allowed to define cleanup rules and the backend will regularly execute the policy to remove packages/package files that are not kept by the policy.

In true iteration spirit, the first iteration will have a single rule. Users will be able to define how many duplicated package files (by filename) need to be kept.

Example: for maven package, a pom.xml is uploaded on each publication. If you publish the same version 100 times, you end up with 100 pom.xml package files. Users will be able to state that they only want to keep the 10 most recent pom.xml files.

For this feature, there are several parts:

The policy model. That's !85918 (merged).
Expose the policy object through GraphQL. That's !87799 (merged).
Prepare services for the background job 👈 This MR.
The background job that executes the cleanup policies.

This is issue #346153 (closed).

This MR focuses on introducing the service (and preparing the existing ones) for the background job that will be responsible of executing the cleanup policy.

In short, we introduce here the Packages::Cleanup::ExecutePolicyService is introduced. Before dvelving into details, a few points to keep in mind:

Cleanup policies will be executed regularly. For the first iteration, we're aiming for a 12h frequency.
Given that we're introducing a cleanup rule, the amount of work that the executor will face can be really large. Imagine all the duplicated package files accumulated in a single package.
As such, we will never try to do the cleanup in one iteration. Instead, we just process as much package files as possible. In other words, the #execute method of the service is hard limited in time.
- If not all duplicated package files are not removed in one iteration, that's fine. The next execution will resume the cleanup.
Because we can have multiple executions for cleanups, what is most important is accuracy: the service should never delete a duplicated package file during a "partial" execution that would not be deleted by a "complete" execution.

Lastly, one word on the destruction of package files. Because they are tied to object storage files, we can't destroy a large set of them in one go. Instead, what we do is that we mark the package file (status pending_destruction) and another background job take care of them in a slow fashion (deleting them one by one).

🤔 What does this MR do and why?

Introduce the Packages::Cleanup::ExecutePolicyService to support the single rule we have: keep_n_duplicated_package_files.
This service uses an existing one: Packages::MarkPackageFilesForDestructionService. However, as stated in the Context, we have some constraints on the execution time here, so this service is upgrade as follows:
- supports a deadline. When hit, the service will auto stop its execution.
- supports passing the batch size for the execution. See the database analysis why we need to do so.
Add/Updated related specs.

These changes are not used at all for now. As described, an upcoming MR will add the background job that will run regularly and call this service.

🖼 Screenshots or screen recordings

n / a

⚙ How to set up and validate locally

We're going to create a bunch of dummy packages with duplicates packages. We will then create a packages cleanup policy to keep only 1 duplicated package file = only the most recent one will be kept.

Given that the execute service is not called by anything in the codebase (for now), we're going to do this all in the Rails console:

Follow this to define a fixture_file_upload function.

Let's create 3 packages:

project = Project.first
pkg1 = FactoryBot.create(:generic_package, project: project)
pkg2 = FactoryBot.create(:generic_package, project: project)
pkg3 = FactoryBot.create(:generic_package, project: project)

Let's add some dummy files:

FactoryBot.create(:package_file, :generic, package: pkg1, file_name: 'file_for_pkg1.txt')
2.times { FactoryBot.create(:package_file, :generic, package: pkg2, file_name: 'file_for_pkg2.txt') }
3.times { FactoryBot.create(:package_file, :generic, package: pkg3, file_name: 'file_for_pkg3.txt') }

Check the created files (check the status column)

pkg1.reload.package_files
pkg2.reload.package_files
pkg3.reload.package_files

Create the packages cleanup policy that will keep only 1 duplicated package files:

policy = project.packages_cleanup_policy.update!(keep_n_duplicated_package_files: '1')

Let's execute the policy:

Packages::Cleanup::ExecutePolicyService.new(policy).execute

Let's re inspect files:

pkg1.reload.package_files
pkg2.reload.package_files
pkg3.reload.package_files

The most recent package file has status: 'default' and all the others has status: 'pending_destruction'. That's the expected behavior ✅

🚥 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

💾 Database review

From a database perspective, we have two embedded .each_batch loops running and a side query triggered. That gives us 7 queries to analyze (3 per batched loop + 1).

⤴ Migration up

$ rails db:migrate

main: == 20220617142124 AddIndexOnInstallablePackageFiles: migrating ================
main: -- transaction_open?()
main:    -> 0.0000s
main: -- index_exists?(:packages_package_files, [:package_id, :id, :file_name], {:where=>"(status = 0)", :name=>"idx_pkgs_installable_package_files_on_package_id_id_file_name", :algorithm=>:concurrently})
main:    -> 0.0115s
main: -- execute("SET statement_timeout TO 0")
main:    -> 0.0004s
main: -- add_index(:packages_package_files, [:package_id, :id, :file_name], {:where=>"(status = 0)", :name=>"idx_pkgs_installable_package_files_on_package_id_id_file_name", :algorithm=>:concurrently})
main:    -> 0.0043s
main: -- execute("RESET statement_timeout")
main:    -> 0.0004s
main: == 20220617142124 AddIndexOnInstallablePackageFiles: migrated (0.0247s) =======

main: == 20220617143228 ReplacePackagesIndexOnProjectIdAndStatus: migrating =========
main: -- transaction_open?()
main:    -> 0.0000s
main: -- index_exists?(:packages_packages, [:project_id, :status, :id], {:name=>"index_packages_packages_on_project_id_and_status_and_id", :algorithm=>:concurrently})
main:    -> 0.0105s
main: -- add_index(:packages_packages, [:project_id, :status, :id], {:name=>"index_packages_packages_on_project_id_and_status_and_id", :algorithm=>:concurrently})
main:    -> 0.0026s
main: -- transaction_open?()
main:    -> 0.0000s
main: -- indexes(:packages_packages)
main:    -> 0.0101s
main: -- remove_index(:packages_packages, {:algorithm=>:concurrently, :name=>"index_packages_packages_on_project_id_and_status"})
main:    -> 0.0026s
main: == 20220617143228 ReplacePackagesIndexOnProjectIdAndStatus: migrated (0.0304s)

⤵ Migration down

$ rails db:rollback

main: == 20220617143228 ReplacePackagesIndexOnProjectIdAndStatus: reverting =========
main: -- transaction_open?()
main:    -> 0.0000s
main: -- index_exists?(:packages_packages, [:project_id, :status], {:name=>"index_packages_packages_on_project_id_and_status", :algorithm=>:concurrently})
main:    -> 0.0136s
main: -- execute("SET statement_timeout TO 0")
main:    -> 0.0004s
main: -- add_index(:packages_packages, [:project_id, :status], {:name=>"index_packages_packages_on_project_id_and_status", :algorithm=>:concurrently})
main:    -> 0.0033s
main: -- execute("RESET statement_timeout")
main:    -> 0.0003s
main: -- transaction_open?()
main:    -> 0.0000s
main: -- indexes(:packages_packages)
main:    -> 0.0095s
main: -- remove_index(:packages_packages, {:algorithm=>:concurrently, :name=>"index_packages_packages_on_project_id_and_status_and_id"})
main:    -> 0.0023s
main: == 20220617143228 ReplacePackagesIndexOnProjectIdAndStatus: reverted (0.0386s) 

main: == 20220617142124 AddIndexOnInstallablePackageFiles: reverting ================
main: -- transaction_open?()
main:    -> 0.0000s
main: -- indexes(:packages_package_files)
main:    -> 0.0111s
main: -- execute("SET statement_timeout TO 0")
main:    -> 0.0004s
main: -- remove_index(:packages_package_files, {:algorithm=>:concurrently, :name=>"idx_pkgs_installable_package_files_on_package_id_id_file_name"})
main:    -> 0.0027s
main: -- execute("RESET statement_timeout")
main:    -> 0.0004s
main: == 20220617142124 AddIndexOnInstallablePackageFiles: reverted (0.0216s) =======

🔍 Queries analysis

First (parent) batch loop: !90395 (comment 994422976)
Side query: !90395 (comment 995638843)
Second (child) batched loop: !90395 (comment 995687550)

All the queries were analyzed to stay within the performance guidelines. We went quite far into fine tuning the batch size of each batch loop. As stated in the Context, keep in mind that we don't need to process all duplicated package files in one go. We just need to make sure that our loops are executed in an efficient manner.

Also keep in mind that all these queries are executed within a background job. They will not be triggered in a web request.

🚀 Index analysis

Brand new index: !90395 (comment 996789237)
Updating an existing one: !90395 (comment 995795562)

🔬 Database testing pipeline result

See !90395 (comment 995805193)

Edited Jun 21, 2022 by David Fernandez

Add the packages execute cleanup service