GitLab housekeeping causes large numbers of loose objects in filesystem
Summary
We are facing a recurring issue with the way GitLab performs housekeeping. Consider a repository with many unreachable objects (e.g. many branches were deleted, an old commit was changed and history was rewritten…)
GitLab housekeeping runs in 3 stages: incremental pack, full repack and gc, taking place after a configurable amount of pushes (this is configured globally). The “full repack” stage will unpack all the unreachable objects into loose objects as individual files in the filesystem. These files will stay there and (if there are many of them) considerably slow down the repository, until a gc
pass cleans them up (provided 2 weeks have elapsed since the unreachable objects were last packed as per the default value for gc.pruneExpire
).
Thus there are some cases where the housekeeping process actually makes the repository slower than before and puts a higher load on the filesystem.
In our case, large repos with many forks occasionally host many unreachable objects: a recent practical example repo on our instance had 300K unreachable objects and 400 forks. Depending on the fork activity, each of the 400 forks got repacked progressively over time by the housekeeping process, and started accumulating 300K loose objects each in their repo filesystem. At some point performance was dropping and we had to run gc
manually in each fork to reduce the number of files on the storage (there were more than 10 million excess files, an order of magnitude more than all the other repos combined!).
We would like to see how to avoid that such situations happen again. It seems it could be easily avoided by not letting the “full repack” phase unpack unreachable objects, since there is no reason to keep them.
Steps to reproduce
This should be easy to repro by creating unreachable objects in a repository and then running the gitlab housekeeping commands:
-
Create a new repo, clone, add a remote to another project (the other project will provide some content to push and then make unreachable)
-
Push a branch in the new repo with content from the other project and push
-
Server side: run an incremental pack (gitlab runs
git -c repack.writeBitmaps=false repack -d
) to make sure objects are packed -
Delete the branch to create unreachable objects in the new repo
-
Server side: run a full repack (gitlab runs
git -c repack.writeBitmaps=true repack -A -d --pack-kept-objects
)
What is the current behavior?
All the unreachable objects are unpacked as individual files in objects, making performance worse than before housekeeping
What is the expected correct behavior?
The “full repack” phase optimizes storage
Possible fixes
Here’s a bit of research we’ve done while troubleshooting the issue:
Currently gitlab housekeeping runs this command:
git -c repack.writeBitmaps=true repack -A -d --pack-kept-objects
Using -a
instead of -A
would discard all these unreachable objects, but I suppose that may cause problems if the repack operation takes place concurrently with something writing new objects (push etc.). More interestingly, git gc
internally runs a similar command, but with option --unpack-unreachable=2.weeks.ago
(or whatever is configured in gc.pruneExpire
) so only the most recent unreachable objects are kept. It may make sense that GitLab adds a similar parameter to the command. Perhaps values like 1.days.ago
or 2.days.ago
would be appropriate.
Yet another option may be to disable unpacking unreachable objects altogether with the --keep-unreachable
option to git-repack
. Unreachable objects will not be deleted (like happens in the current housekeeping process) but at least they will remain more efficiently stored in a pack rather than unpacked as many files.
Alternatively, we could work around the issue by configuring housekeeping to skip the full repack phase altogether, and only let gitlab housekeeping do (frequent) incremental repacks and (less frequent) gc
. If the same period were configured for full repacks and gc
(https://gitlab.com/gitlab-org/gitlab-ce/blob/master/app/services/projects/housekeeping_service.rb) would just perform gc
as we’d like to. But parameter validation of the housekeeping settings does not currently let us set the same period for full repack and gc
. (gc
period must be > full repack period)