Draft: Prototype to parallelize project export worker
Context
For context, follow some information on how the export is working now. Added as a different section because it's too much information. 😄 . Click to view the hidden information 👈
Before discussing how to parallelize the ProjectExportWorker, below there are some information on the feature currently implemented
ProjectExportWorker
is a single thread process responsible for exporting a project into a tarball file. Underneath it uses the Projects::ImportExport::ExportService
that is the main responsible for exporting the project.
An example of the content of the file can be seen below:
|-tree
| |-project.json
| |-project
| | |-protected_environments.ndjson
| | |-boards.ndjson
| | |-project_members.ndjson
| | |-auto_devops.ndjson
| | |-ci_pipelines.ndjson
| | |-merge_requests.ndjson
| | |-releases.ndjson
| | |-protected_branches.ndjson
| | |-prometheus_metrics.ndjson
| | |-project_badges.ndjson
| | |-custom_attributes.ndjson
| | |-milestones.ndjson
| | |-metrics_setting.ndjson
| | |-push_rule.ndjson
| | |-labels.ndjson
| | |-snippets.ndjson
| | |-external_pull_requests.ndjson
| | |-issues.ndjson
| | |-protected_tags.ndjson
| | |-container_expiration_policy.ndjson
| | |-pipeline_schedules.ndjson
| | |-ci_cd_settings.ndjson
| | |-security_setting.ndjson
| | |-error_tracking_setting.ndjson
| | |-service_desk_setting.ndjson
| | |-project_feature.ndjson
|-snippets
| |-59802ab7b9610c444d6231da4c1bcdc0d54af594525476219b33f957f52939c2.bundle
| |-5e37af85d49fb5656f55c62510c4a5d47b01fc6bfcddf4c6e3ffedc25640d291.bundle
|-GITLAB_VERSION
|-project.wiki.bundle
|-uploads
| |-e5ed7bc6b1fecb851ed7b120fcfe8a15
| | |-image.jpeg
|-lfs-objects.json
|-lfs-objects
| |-2cb698f3a725e42800172f662a10de64e26fbc4425ad871609472add43c77ffc
| |-a804a6aec96c1f2a7db1c5e22c658405343272da1069ae26ce34a3cc39d83130
| |-89dc431c9e6e503c46b6b7285f2f5542b03e46f4dd1e11cd73cf10c812f7321c
| |-d262f804319ceb22ec80430141b46dba3f57c4b9f87afa65b3fea377ede7e76e
|-VERSION
|-project.bundle
|-GITLAB_REVISION
|-project.design.bundle
The export process can be triggered in different ways:
-
ProjectsController#export: This action can be triggered via UI when the user clicks on the button to export the project which enqueues the
ProjectExportWorker
-
API::ProjectExport#export: This action can be triggered via API which enqueues the
ProjectExportWorker
-
Gitlab::ImportExport::Project::ExportTask: This is triggered by the rake task
gitlab:import_export:export
. The task doesn’t use theProjectExportWorker
and instead calls the serviceProjects::ImportExport::ExportService
straightaway. -
When a project is created via a custom template: In this case, the worker
ProjectTemplateExportWorker
is used. The worker is identical to theProjectExportWorker
with different higher urgency.
Model - ProjectExportJob
The model ProjectExportJob
is used to control the status of the project export process. The model was introduced to replace the locks files that were used in the past to determine the state of the export process.
A new project export job record is created in the ProjectExportWorker
and the worker JID is assigned to the record so it can be used by the StuckExportJobsWorker
to mark stuck exports as failed.
The model defines the following possible status
- queued
- started
- finished
- failed
Project export status - API
The following export status is documented as possible values
- none: No exports _queued_, _started_, _finished_, or _being regenerated_.
- queued: The request for export is received, and is in the queue to be processed.
- started: The export process has started and is in progress. It includes:
- The process of exporting.
- Actions performed on the resulting file, such as sending an email notifying
the user to download the file, or uploading the exported file to a web server.
- finished: After the export process has completed and the user has been notified.
- regeneration_in_progress: An export file is available to download, and a request to generate a new export is in process.
To determine the project export status, the status of the ProjectExportJob
associated to the project is used.
Notes:
- The
regeneration_in_progress
is documented as a possible state, but currently, it never happens because now when a new export is requested, the previous export file is deleted. So a situation a processing project export job exists and the download file still exists never. !37427 (merged) - It’s almost impossible for a project export to be in the
queued
state because the state is set toqueued
when the recordProjectExportJob
is created inside theProjectExportWorker
and straight after it’s updated tostarted
https://gitlab.com/gitlab-org/gitlab/blob/cd0296a7b9402b1bfcdb39c14a332144186a84a4/app/workers/project_export_worker.rb#L19-22 - The API doesn’t have a state for
failed
jobs. If a job fails, the reported state isnone
. Which can be confusing to the user. - As it is possible to trigger the export process of a project simultaneously, the user will only see the
finished
state when all jobs are completed. And the export available for download will be the job last completed.
After-export strategy
After exporting a project, depending on the case, a follow-up process is executed. These processes in the backend are called after-export strategies
Currently, the following strategies are defined:
- DownloadNotificationStrategy: Sends a notification to let the user know that the export was completed and it’s available to be downloaded. This is the default strategy.
- WebUploadStrategy: Upload the tarball to an external source, for example, an external S3 bucket. This strategy can be set when using the API.
- MoveFileStrategy: Move the tarball to a provided location in the local disk. This strategy is used for the rake task.
- CustomTemplateExportImportStrategy: Used to import the custom template into a project. See the "Custom Template" section.
Note: The strategies WebUploadStrategy and CustomTemplateExportImportStrategy expect parameters to be passed. For example, the WebUploadStrategy requires the destination URL where the tarball will be sent. Currently, the parameters aren’t stored in the database, they are passed as extra parameters to the ProjectExportWorker
which means they are kept temporary on Redis/Sidekiq until the execution of the job. ****
Custom templates
Projects created from a custom template use the after-export strategy CustomTemplateExportImportStrategy.
When a project is selected to be created from a custom template, a bare minimum project is created, then the ProjectTemplateExportWorker
worker is enqueued with the instruction to export the custom template, which generates an exported tarball of the custom template, and then the strategy enqueues a job RepositoryImportWorker
to import the tarball into the bare minimum project that was created in the beginning.
It’s important to highlight that for every project created from a custom template, a process to export the custom template is executed, so it’s possible that a concurrent export process to happen for the same project.
ImportExportProjectCleanupWorker
This worker runs every hour and deletes and project upload export that the last update was 24 hours ago. It also deletes files used to generate the tarball that the last modified date is older than 24 hours.
StuckExportJobsWorker
This worker runs every hour and marks any ProjectExportJob
that has the status enqueued
or started
as failed if the jobs (JID) associated with them no longer exist in Sidekiq.
The JID won’t exist in Sidekiq if the job is completed successfully, failed, or took more than 6 hours to complete.
What does this MR do and why?
This MR is a proof of concept on how we could parallelize Project export worker
The prototype uses the BulkImports::RelationExportService
to generate the export relations in separated jobs and updates Projects::ImportExport::ExportService
to download and create the tarball using the relation created in parallel.
In order to track when all the relation exports are complete, the worker ImportExport::TrackExportRelationsWorker
is used to track the relations completeness state. When all relations are completed, the worker triggers the ProjectExportWorker
to continue the export process.
The following diagram gives an overview on the order of the events
sequenceDiagram
participant model as Project#35;export
participant relations_service as BulkImports#58;#58;RelationExportService
participant relations_worker as BulkImports#58;#58;BulkImports::RelationExportWorker
participant tracker as ImportExport#58;#58;TrackExportRelationsWorker
participant export_worker as ProjectExportWorker
participant export_service as Projects#58;#58;ImportExport#58;#58;ExportService
par
model->>relations_service: Calls relations generation service
relations_service->>relations_worker: Trigger relation A generation
relations_service->>relations_worker: Trigger relation B generation
relations_service->>relations_worker: Trigger relation ... generation
relations_service->>relations_worker: Trigger relation Z generation
model->>tracker: Enqueues the tracker
loop
tracker->>tracker: Monitor if all relations were generated. <br />Keep reenqueing itself until all relations are generarated
end
end
tracker->>export_worker: Enqueues the worker
export_worker->>export_service: Download relation files and build tarball
The project_export_jobs
table was updated to allow null JID
because since the ProjectExportWorker
won't be enqueued when the project_export_jobs
record is created, the record needs to allow null JID. Because of this change, in the final solution, the StuckExportJobsWorker
will have to be updated to deal with empty JIDs.
The AsyncProjectSaver
(ignore the quality of the code ) is responsible for downloading the relations and moving the files to the correct location so that the final tarball is generated like before. The code got a bit confusing because each relation requires the file to be moved to a specific location.
The AsyncProjectSaver
isn't handling the wiki and the snippets because the wiki relation isn't generated and the snippets are being generated as a NDJSON and for Import/Export they need to be the repository bundle.
Because this MR is a proof of concept and it isn't a complete solution, it has some flaws that need to be addressed in the final solution.
Problems:
- The
StuckExportJobsWorker
worker needs to be updated to handle null JIDs - The
project_export_job
status only changes fromqueued
tostarted
after all the relations are exported and theProjectExportWorker
starts - This solution doesn't support concurrent exports
- The
ImportExport::TrackExportRelationsWorker
worker should fail the whole export process if one relation fails to export.