Restore optimized newly expired artifacts removal
What does this MR do and why?
This merge request restores the performant Ci::JobArtifact
removal process that was introduced in !72406 (merged). This code is set up to remove expired CI::JobArtifact
records without referencing any other tables, and is therefore able to remove them at a much faster pace.
Prior to gitlab-com/gl-infra/production#5952 (closed), we were expiring artifacts at a rate of 1-3k per execution. So in this merge request, I've added a feature flag ci_artifact_fast_removal_large_loop_limit
that switches the record from 1_000
to 10_000
records per execution of the worker.
In practice, these mean:
-
1_000
: Continue expiring artifacts in a performant fashion, but slow down to a value we know to be well within safe operational bounds. -
10_000
: Expire all newly created artifacts as they expire, but we're setting a speed limit at 10,000 records that we don't expect to hit. If we see a big influx and decide we need to slow down to vacuum the table, this will prevent dead tuples from accumulating at an unanticipated rate.
Note: Since the worker has been turned off for a few weeks, we expect to see the 10_000
record limit hit in the first executions of the worker. This should drop off and return to 1-3k per execution.
How to set up and validate locally
The behavior we want is laid out in the modified spec files. Read and run those to see what we intend the behaviors to be.
Unfortunately the real spirit of what we're achieving here is not demonstrable outside of a production environment, since it involves throttling the write speed to the production ci_job_artifacts
table.
Queries and plans
Here's the full list of queries made by an execution of this worker, with a query plan below for the query that we run on a loop continuously to load records from ci_job_artifacts
. For context, we did this in a sustainable way before and we don't expect performance issues from this code.
SELECT "ci_job_artifacts"."project_id", "ci_job_artifacts"."file_type", "ci_job_artifacts"."size", "ci_job_artifacts"."created_at", "ci_job_artifacts"."updated_at", "ci_job_artifacts"."expire_at", "ci_job_artifacts"."file", "ci_job_artifacts"."file_store", "ci_job_artifacts"."file_sha256", "ci_job_artifacts"."file_format", "ci_job_artifacts"."file_location", "ci_job_artifacts"."id", "ci_job_artifacts"."job_id", "ci_job_artifacts"."locked" FROM "ci_job_artifacts" WHERE "ci_job_artifacts"."expire_at" < '2021-12-10 02:58:18.560624' AND "ci_job_artifacts"."locked" = 0 LIMIT 100
SELECT "projects"."id", "projects"."name", "projects"."path", "projects"."description", "projects"."created_at", "projects"."updated_at", "projects"."creator_id", "projects"."namespace_id", "projects"."last_activity_at", "projects"."import_url", "projects"."visibility_level", "projects"."archived", "projects"."avatar", "projects"."merge_requests_template", "projects"."star_count", "projects"."merge_requests_rebase_enabled", "projects"."import_type", "projects"."import_source", "projects"."approvals_before_merge", "projects"."reset_approvals_on_push", "projects"."merge_requests_ff_only_enabled", "projects"."issues_template", "projects"."mirror", "projects"."mirror_user_id", "projects"."shared_runners_enabled", "projects"."runners_token", "projects"."build_coverage_regex", "projects"."build_allow_git_fetch", "projects"."build_timeout", "projects"."mirror_trigger_builds", "projects"."pending_delete", "projects"."public_builds", "projects"."last_repository_check_failed", "projects"."last_repository_check_at", "projects"."only_allow_merge_if_pipeline_succeeds", "projects"."has_external_issue_tracker", "projects"."repository_storage", "projects"."repository_read_only", "projects"."request_access_enabled", "projects"."has_external_wiki", "projects"."ci_config_path", "projects"."lfs_enabled", "projects"."description_html", "projects"."only_allow_merge_if_all_discussions_are_resolved", "projects"."repository_size_limit", "projects"."printing_merge_request_link_enabled", "projects"."auto_cancel_pending_pipelines", "projects"."service_desk_enabled", "projects"."cached_markdown_version", "projects"."delete_error", "projects"."last_repository_updated_at", "projects"."disable_overriding_approvers_per_merge_request", "projects"."storage_version", "projects"."resolve_outdated_diff_discussions", "projects"."remote_mirror_available_overridden", "projects"."only_mirror_protected_branches", "projects"."pull_mirror_available_overridden", "projects"."jobs_cache_index", "projects"."external_authorization_classification_label", "projects"."mirror_overwrites_diverged_branches", "projects"."pages_https_only", "projects"."external_webhook_token", "projects"."packages_enabled", "projects"."merge_requests_author_approval", "projects"."pool_repository_id", "projects"."runners_token_encrypted", "projects"."bfg_object_map", "projects"."detected_repository_languages", "projects"."merge_requests_disable_committers_approval", "projects"."require_password_to_approve", "projects"."emails_disabled", "projects"."max_pages_size", "projects"."max_artifacts_size", "projects"."remove_source_branch_after_merge", "projects"."marked_for_deletion_at", "projects"."marked_for_deletion_by_user_id", "projects"."autoclose_referenced_issues", "projects"."suggestion_commit_message", "projects"."project_namespace_id" FROM "projects" WHERE "projects"."id" = 78
SELECT "routes".* FROM "routes" WHERE "routes"."source_type" = 'Project' AND "routes"."source_id" = 78
SELECT "project_statistics".* FROM "project_statistics" WHERE "project_statistics"."project_id" = 78
DELETE FROM "security_findings" WHERE "security_findings"."id" IN (SELECT "security_findings"."id" FROM "security_findings" INNER JOIN "security_scans" ON "security_scans"."id" = "security_findings"."scan_id" WHERE "security_scans"."build_id" = 108)
SAVEPOINT active_record_2
INSERT INTO "ci_deleted_objects" ("file_store","store_dir","file","pick_up_at") VALUES (1, '34/9c/349c41201b62db851192665c504b350ff98c6b45fb62a8a2161f78b6534d8de9/2021_12_10/108/1190', 'ci_build_artifacts.zip', '2021-12-09 00:00:00') ON CONFLICT DO NOTHING RETURNING "id"
DELETE FROM "ci_job_artifacts" WHERE "ci_job_artifacts"."id" = 1190
RELEASE SAVEPOINT active_record_2
SELECT 1 AS one FROM "geo_nodes" LIMIT 1
Is one iteration, when there are artifacts to be removed that have associated files in Object Storage. After that, we start from the top with:
SELECT "ci_job_artifacts"."project_id", ... WHERE "ci_job_artifacts"."expire_at" < '2021-12-10 02:58:18.560624' AND "ci_job_artifacts"."locked" = 0 LIMIT 100
either 9 or 99 more times.
Fetching expired unlocked artifacts
Once per loop iteration, we execute the following query. This means it will fire either 10
or 100
times per worker execution, every seven minutes.
SELECT "ci_job_artifacts"."project_id", "ci_job_artifacts"."file_type", "ci_job_artifacts"."size", "ci_job_artifacts"."created_at", "ci_job_artifacts"."updated_at", "ci_job_artifacts"."expire_at", "ci_job_artifacts"."file", "ci_job_artifacts"."file_store", "ci_job_artifacts"."file_sha256", "ci_job_artifacts"."file_format", "ci_job_artifacts"."file_location", "ci_job_artifacts"."id", "ci_job_artifacts"."job_id", "ci_job_artifacts"."locked" FROM "ci_job_artifacts" WHERE "ci_job_artifacts"."expire_at" < '2021-12-10 02:43:56.582765' AND "ci_job_artifacts"."locked" = 0 LIMIT 100
Limit (cost=0.57..103.82 rows=100 width=140) (actual time=6.241..56.041 rows=100 loops=1)
Buffers: shared hit=4 read=40 dirtied=5
I/O Timings: read=54.185 write=0.000
-> Index Scan using ci_job_artifacts_expire_at_unlocked_idx on public.ci_job_artifacts (cost=0.57..27576730.60 rows=26708556 width=140) (actual time=6.239..56.013 rows=100 loops=1)
Index Cond: (ci_job_artifacts.expire_at < '2021-12-10 02:43:56.582765+00'::timestamp with time zone)
Buffers: shared hit=4 read=40 dirtied=5
I/O Timings: read=54.185 write=0.000
Time: 60.483 ms
- planning: 4.370 ms
- execution: 56.113 ms
- I/O read: 54.185 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 4 (~32.00 KiB) from the buffer pool
- reads: 40 (~320.00 KiB) from the OS file cache, including disk I/O
- dirtied: 5 (~40.00 KiB)
- writes: 0
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Deployment:
- Merge this code, deploy to production
- Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!1393 (merged) and deploy to production workers
- turn
https://gitlab.com/gitlab-org/gitlab/-/issues/348786
on to re-enable the DestroyAllExpiredService #348786 (closed) - Turn
ci_destroy_unlocked_job_artifacts
on to use the more efficient expiration logic onci_job_artifact
records with known locked values. - Observe worker execution with
SMALL_LOOP_LIMIT
: Kibana, Thanos - Turn on
ci_artifact_fast_removal_large_loop_limit
Rollout: #347151 (closed) - Observe worker execution with
LARGE_LOOP_LIMIT
Kibana, Thanos - Move on to merging and deploying !76509 (merged)
Note: The Thanos links are to the incident time window for reference. To observe current behavior, update the time window.