Problems arise if there are multiple parallel Omnibus backup/restore processes
Summary
Unpredictable behaviour can occur if there's more than one backup process running, and it appears the backup code doesn't protect against this: eg - a lock/PID file mechanism to identify the duplication
Support has been investigating the following error (GitLab team members can read more in the ticket) generated by a 14.9.x self-managed instance:
2022-04-13 04:34:48 +0000 -- Dumping database ...
Dumping PostgreSQL database gitlabhq_production ... [DONE]
2022-04-13 04:36:28 +0000 -- done
[snip]
Creating backup archive: 1649824588_2022_04_13_14.0.12-ee_gitlab_backup.tar ... tar: db: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors
creating archive 1649824588_2022_04_13_14.0.12-ee_gitlab_backup.tar failed
rake aborted!
Backup::Error: Backup failed
/opt/gitlab/embedded/service/gitlab-rails/lib/backup/manager.rb:36:in `block in pack'
/opt/gitlab/embedded/service/gitlab-rails/lib/backup/manager.rb:27:in `chdir'
/opt/gitlab/embedded/service/gitlab-rails/lib/backup/manager.rb:27:in `pack'
/opt/gitlab/embedded/service/gitlab-rails/lib/tasks/gitlab/backup.rake:27:in `block (3 levels) in <top (required)>'
/opt/gitlab/embedded/bin/bundle:23:in `load'
/opt/gitlab/embedded/bin/bundle:23:in `<main>'
Tasks: TOP => gitlab:backup:create
(See full trace by running task with --trace)
The root cause looks likely to be:
- Scheduled backup took longer than usual to run.
- A GitLab upgrade was initiated, with kicks off a backup for the database.
- The scheduled backup completed, likely creating a tar file (with, now, inconsistent contents) and deleting all the working files and directories.
- The database backup completes, but the output files and directories have been removed.
Note also the issue we have for the contents of restores being left behind on disk: #334401 (closed)
This points to backups and restores operate in the same location using the same files and directories, so there's scope for restores to get messed up by scheduled backups as well.
Steps to reproduce
Run multiple backups on the same machine at the same time.
Example Project
n/a
What is the current bug behavior?
Multiple backup processes will co-exist in /var/opt/gitlab/backups
but in practise are using the same output files.
What is the expected correct behavior?
As written, it's not going to be safe to run multiple backups and restores in parallel, so the second process should detect an inflight process and terminate, or wait for the existing process to complete.
Inevitably, we'll get cases where a backup failed or was terminated and left behind the lock mechanism. Then the backup will cease running, and the concequence is likely to be customers who didn't know they had no backups. So, ideally, the second backup needs to take reasonable steps to check if there's actually an existing backup running, and with bias towards trying to make a backup.
Alternatively, segregate the temporary working directories. Perhaps make it optional, so a GitLab upgrade can "request" its own separate part of the filesystem, as a database backup won't use much space.
Relevant logs and/or screenshots
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true
)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true
)(we will only investigate if the tests are passing)