Migrate job artifacts data from `ci_builds` to `ci_job_artifacts` table (with Background migrations)
Summary of this MR
This merge request migrates job artifacts data from ci_builds
to ci_job_artifacts
table. After we've accomplished this, we can simplify SQLs for looking up job artifacts, and effectively fixes broken ExpireBuildArtifactsWorker
.
History
Until GitLab 10.3, job artifacts data had been stored in ci_builds
table. Everytime GitLab-rails received artifacts blob from GitLab-runners, ci_builds.artifacts_file
and ci_builds.artifacts_metadata
columns were updated and actual files were stored into a directory.
But it turned out this architecture was not scalable, because we had to add a new column to ci_builds
per introducing a new artifact type. Therefore, in GitLab 10.3, we released a new home for job artifacts - ci_job_artifacts
. This table persists one artifact data per a row, and since it has file_type
column (Enum), it's easy to add a new artifact type. We've been using this architecture since 10.3, and so far there are no problems.
To summarize our current situation, please take a look at the following table.
Timeline | Table | Path (Example) |
---|---|---|
Job artifacts (~ %10.3) |
ci_builds.artifacts_file , ci_builds.artifacts_metadata
|
"shared/artifacts/2018_03/12/10/archive.zip" "shared/artifacts/2018_03/12/10/metadata.gzip" |
Job artifacts (%10.3 ~ Present) |
ci_job_artifacts.file , ci_job_artifacts.file_type (archive: 1, metadata: 2 ) |
"shared/artifacts/...hash.../100/10/1/archive.zip", "shared/artifacts/...hash.../100/10/2/metadata.gzip" |
Also, please note that we often refer the job artifacts before 10.3 as Legacy job artifacts.
Why this MR is necessary
Since 10.3, the new home ci_job_artifacts
has been working finely on our production server. But the legacy artifacts data still resides in ci_builds
table. Also, we found some SQL query is slow because those have to SELECT
from both ci_builds
table and ci_job_artifacts
tables. In fact, ExpireBuildArtifactsWorker
(Cron worker) has been broken on gitlab.com for a few months that it fails every time by statement timeouts. In order to fix those problems, we need to migrate legacy artifacts to new home, and accomplish the single source of truth.
How this MR migrates
We migrate data from ci_builds
to ci_job_artifacts
table, however, the thing is not simple, because we also have to consider migrations for actual files (blobs).
To understand this problem, here are the columns to persist legacy artifacts information.
- ci_builds.artifacts_file ... File name of the artifact file
- ci_builds.artifacts_file_store ... File store of the artifact file (store can be either local file storage or remote object storage)
- ci_builds.artifacts_size ... File size of the artifact file
- ci_builds.artifacts_metadata ... File name of the artifact metadata file
- ci_builds.artifacts_file_store ... File store of the artifact metadata file (store can be either local file storage or remote object storage)
and the actual files were stored in a directory. The convention of a directory path is shared/artifacts/YYYY_MM/project_id/build_id/
.
Migrating data from the above columns is easy, however, migrating actual files is difficult, because those files have already been uploaded to object storage (S3) on gitlab.com. In addition. there are no API provided from OS, to move files as batch operation. If we move actual files one by one, this will take very long time. Also, ingress cost can't be ignorable as network traffic going back and forth between GCP and S3. We should avoid this problem, at least for now.
Actually, there are no problems even if we didn't migrate actual files that it's not stored in database anyway. Wherever it is, we can look up as long as we persist the information of where it's stored.
So, this MR migrates data in database only. Here is the expected result after this migration finished.
Timeline | Table | Path (Example) |
---|---|---|
Job artifacts (~ Present) |
ci_job_artifacts.file , ci_job_artifacts.file_type (archive: 1, metadata: 2 ) |
If file_location is :legacy_path , then it uses "shared/artifacts/2018_03/12/10/archive.zip" "shared/artifacts/2018_03/12/10/metadata.gzip". If file_location is :hashed_path , then it uses "shared/artifacts/...hash.../100/10/1/archive.zip", "shared/artifacts/...hash.../100/10/2/metadata.gzip". |
Statics on gitlab.com
- The number of legacy artifacts ... 3,630,545
- The number of ci_builds ... 89,972,286
- The number of ci_job_artifacts ... 77,238,225
- The biggest gap in a single batch range ...
Batch: 9538 range: 9603987.. 9705164
(delta: 101,177)
NOTE: https://gitlab.com/gitlab-com/infrastructure/issues/4212 is the investigation issue
Future iterations
Clean up MigrateLegacyArtifacts
background migrations with BackgroundMigration.steal
This MR uses background migrations, thus it needs to clean up at the next major/minor release as it's described in handbook. Also, during the cleanup-phase, the partial index, which was added in this MR, will be removed, as it's no longer necessary.
Fill nullified file_location
on ci_job_ artifacts
table
This MR adds a new column file_location
to indicates that where actual files are stored. If possible, we want to add the column with the default value and NOT NULL
constraint, however, we don't have a way to accomplish this, given ci_job_artifacts
is a very big table, which holds over 50M rows.
Fill size
values of ci_job_artifacts
table from object storage
Legacy artifact rows persisted file size, however, legacy artifact metadata rows didn't persist file size. This means all migrated rows of artifact metadata are empty. We need to follow-up in the issue.
Does this MR meet the acceptance criteria?
-
Changelog entry added, if necessary -
Tests added for this feature/bug - Review
-
Has been reviewed by Backend -
Has been reviewed by Database
-
-
End-to-end tests pass ( package-and-qa
manual pipeline job)