Draft: Prototype to parallelize project export worker (!89044) · Merge requests · GitLab.org / GitLab

Rodrigo Tomonari requested to merge rodrigo/parallelize-project-export-worker into master Jun 01, 2022

Context

For context, follow some information on how the export is working now. Added as a different section because it's too much information. 😄. Click to view the hidden information 👈

Before discussing how to parallelize the ProjectExportWorker, below there are some information on the feature currently implemented

ProjectExportWorker is a single thread process responsible for exporting a project into a tarball file. Underneath it uses the Projects::ImportExport::ExportService that is the main responsible for exporting the project.

An example of the content of the file can be seen below:

|-tree
 | |-project.json
 | |-project
 | | |-protected_environments.ndjson
 | | |-boards.ndjson
 | | |-project_members.ndjson
 | | |-auto_devops.ndjson
 | | |-ci_pipelines.ndjson
 | | |-merge_requests.ndjson
 | | |-releases.ndjson
 | | |-protected_branches.ndjson
 | | |-prometheus_metrics.ndjson
 | | |-project_badges.ndjson
 | | |-custom_attributes.ndjson
 | | |-milestones.ndjson
 | | |-metrics_setting.ndjson
 | | |-push_rule.ndjson
 | | |-labels.ndjson
 | | |-snippets.ndjson
 | | |-external_pull_requests.ndjson
 | | |-issues.ndjson
 | | |-protected_tags.ndjson
 | | |-container_expiration_policy.ndjson
 | | |-pipeline_schedules.ndjson
 | | |-ci_cd_settings.ndjson
 | | |-security_setting.ndjson
 | | |-error_tracking_setting.ndjson
 | | |-service_desk_setting.ndjson
 | | |-project_feature.ndjson
 |-snippets
 | |-59802ab7b9610c444d6231da4c1bcdc0d54af594525476219b33f957f52939c2.bundle
 | |-5e37af85d49fb5656f55c62510c4a5d47b01fc6bfcddf4c6e3ffedc25640d291.bundle
 |-GITLAB_VERSION
 |-project.wiki.bundle
 |-uploads
 | |-e5ed7bc6b1fecb851ed7b120fcfe8a15
 | | |-image.jpeg
 |-lfs-objects.json
 |-lfs-objects
 | |-2cb698f3a725e42800172f662a10de64e26fbc4425ad871609472add43c77ffc
 | |-a804a6aec96c1f2a7db1c5e22c658405343272da1069ae26ce34a3cc39d83130
 | |-89dc431c9e6e503c46b6b7285f2f5542b03e46f4dd1e11cd73cf10c812f7321c
 | |-d262f804319ceb22ec80430141b46dba3f57c4b9f87afa65b3fea377ede7e76e
 |-VERSION
 |-project.bundle
 |-GITLAB_REVISION
 |-project.design.bundle

The export process can be triggered in different ways:

ProjectsController#export: This action can be triggered via UI when the user clicks on the button to export the project which enqueues the ProjectExportWorker
API::ProjectExport#export: This action can be triggered via API which enqueues the ProjectExportWorker
Gitlab::ImportExport::Project::ExportTask: This is triggered by the rake task gitlab:import_export:export. The task doesn’t use the ProjectExportWorker and instead calls the service Projects::ImportExport::ExportService straightaway.
When a project is created via a custom template: In this case, the worker ProjectTemplateExportWorker is used. The worker is identical to the ProjectExportWorker with different higher urgency.

Model - ProjectExportJob

The model ProjectExportJob is used to control the status of the project export process. The model was introduced to replace the locks files that were used in the past to determine the state of the export process.

A new project export job record is created in the ProjectExportWorker and the worker JID is assigned to the record so it can be used by the StuckExportJobsWorker to mark stuck exports as failed.

The model defines the following possible status

- queued
- started
- finished
- failed

Project export status - API

The following export status is documented as possible values

- none: No exports _queued_, _started_, _finished_, or _being regenerated_.
- queued: The request for export is received, and is in the queue to be processed.
- started: The export process has started and is in progress. It includes:
  - The process of exporting.
  - Actions performed on the resulting file, such as sending an email notifying
    the user to download the file, or uploading the exported file to a web server.
- finished: After the export process has completed and the user has been notified.
- regeneration_in_progress: An export file is available to download, and a request to generate a new export is in process.

To determine the project export status, the status of the ProjectExportJob associated to the project is used.

Notes:

The regeneration_in_progress is documented as a possible state, but currently, it never happens because now when a new export is requested, the previous export file is deleted. So a situation a processing project export job exists and the download file still exists never. !37427 (merged)
It’s almost impossible for a project export to be in the queued state because the state is set to queued when the record ProjectExportJob is created inside the ProjectExportWorker and straight after it’s updated to started https://gitlab.com/gitlab-org/gitlab/blob/cd0296a7b9402b1bfcdb39c14a332144186a84a4/app/workers/project_export_worker.rb#L19-22
The API doesn’t have a state for failed jobs. If a job fails, the reported state is none. Which can be confusing to the user.
As it is possible to trigger the export process of a project simultaneously, the user will only see the finished state when all jobs are completed. And the export available for download will be the job last completed.

After-export strategy

After exporting a project, depending on the case, a follow-up process is executed. These processes in the backend are called after-export strategies

Currently, the following strategies are defined:

DownloadNotificationStrategy: Sends a notification to let the user know that the export was completed and it’s available to be downloaded. This is the default strategy.
WebUploadStrategy: Upload the tarball to an external source, for example, an external S3 bucket. This strategy can be set when using the API.
MoveFileStrategy: Move the tarball to a provided location in the local disk. This strategy is used for the rake task.
CustomTemplateExportImportStrategy: Used to import the custom template into a project. See the "Custom Template" section.

Note: The strategies WebUploadStrategy and CustomTemplateExportImportStrategy expect parameters to be passed. For example, the WebUploadStrategy requires the destination URL where the tarball will be sent. Currently, the parameters aren’t stored in the database, they are passed as extra parameters to the ProjectExportWorker which means they are kept temporary on Redis/Sidekiq until the execution of the job. ****

Custom templates

Projects created from a custom template use the after-export strategy CustomTemplateExportImportStrategy.

When a project is selected to be created from a custom template, a bare minimum project is created, then the ProjectTemplateExportWorker worker is enqueued with the instruction to export the custom template, which generates an exported tarball of the custom template, and then the strategy enqueues a job RepositoryImportWorker to import the tarball into the bare minimum project that was created in the beginning.

It’s important to highlight that for every project created from a custom template, a process to export the custom template is executed, so it’s possible that a concurrent export process to happen for the same project.

ImportExportProjectCleanupWorker

This worker runs every hour and deletes and project upload export that the last update was 24 hours ago. It also deletes files used to generate the tarball that the last modified date is older than 24 hours.

StuckExportJobsWorker

This worker runs every hour and marks any ProjectExportJob that has the status enqueued or started as failed if the jobs (JID) associated with them no longer exist in Sidekiq.

The JID won’t exist in Sidekiq if the job is completed successfully, failed, or took more than 6 hours to complete.

What does this MR do and why?

This MR is a proof of concept on how we could parallelize Project export worker

The prototype uses the BulkImports::RelationExportService to generate the export relations in separated jobs and updates Projects::ImportExport::ExportService to download and create the tarball using the relation created in parallel.

In order to track when all the relation exports are complete, the worker ImportExport::TrackExportRelationsWorker is used to track the relations completeness state. When all relations are completed, the worker triggers the ProjectExportWorker to continue the export process.

The following diagram gives an overview on the order of the events

sequenceDiagram
    participant model as Project#35;export
    participant relations_service as BulkImports#58;#58;RelationExportService
    participant relations_worker as BulkImports#58;#58;BulkImports::RelationExportWorker
    participant tracker as ImportExport#58;#58;TrackExportRelationsWorker
    participant export_worker as ProjectExportWorker
    participant export_service as Projects#58;#58;ImportExport#58;#58;ExportService

    par
        model->>relations_service: Calls relations generation service
        relations_service->>relations_worker: Trigger relation A generation
        relations_service->>relations_worker: Trigger relation B generation
        relations_service->>relations_worker: Trigger relation ... generation
        relations_service->>relations_worker: Trigger relation Z generation
        model->>tracker: Enqueues the tracker
        loop
            tracker->>tracker: Monitor if all relations were generated. <br />Keep reenqueing itself until all relations are generarated
        end
    end    
    tracker->>export_worker: Enqueues the worker 
    export_worker->>export_service: Download relation files and build tarball

The project_export_jobs table was updated to allow null JID because since the ProjectExportWorker won't be enqueued when the project_export_jobs record is created, the record needs to allow null JID. Because of this change, in the final solution, the StuckExportJobsWorker will have to be updated to deal with empty JIDs.

The AsyncProjectSaver (ignore the quality of the code ) is responsible for downloading the relations and moving the files to the correct location so that the final tarball is generated like before. The code got a bit confusing because each relation requires the file to be moved to a specific location.

The AsyncProjectSaver isn't handling the wiki and the snippets because the wiki relation isn't generated and the snippets are being generated as a NDJSON and for Import/Export they need to be the repository bundle.

Because this MR is a proof of concept and it isn't a complete solution, it has some flaws that need to be addressed in the final solution.

Problems:

The StuckExportJobsWorker worker needs to be updated to handle null JIDs
The project_export_job status only changes from queued to started after all the relations are exported and the ProjectExportWorker starts
This solution doesn't support concurrent exports
The ImportExport::TrackExportRelationsWorker worker should fail the whole export process if one relation fails to export.

Edited Jun 12, 2022 by Rodrigo Tomonari

Draft: Prototype to parallelize project export worker

Context

What does this MR do and why?

Merge request reports