GitLab housekeeping causes large numbers of loose objects in filesystem

Summary

We are facing a recurring issue with the way GitLab performs housekeeping. Consider a repository with many unreachable objects (e.g. many branches were deleted, an old commit was changed and history was rewritten…)

GitLab housekeeping runs in 3 stages: incremental pack, full repack and gc, taking place after a configurable amount of pushes (this is configured globally). The “full repack” stage will unpack all the unreachable objects into loose objects as individual files in the filesystem. These files will stay there and (if there are many of them) considerably slow down the repository, until a gc pass cleans them up (provided 2 weeks have elapsed since the unreachable objects were last packed as per the default value for gc.pruneExpire).

Thus there are some cases where the housekeeping process actually makes the repository slower than before and puts a higher load on the filesystem.

In our case, large repos with many forks occasionally host many unreachable objects: a recent practical example repo on our instance had 300K unreachable objects and 400 forks. Depending on the fork activity, each of the 400 forks got repacked progressively over time by the housekeeping process, and started accumulating 300K loose objects each in their repo filesystem. At some point performance was dropping and we had to run gc manually in each fork to reduce the number of files on the storage (there were more than 10 million excess files, an order of magnitude more than all the other repos combined!).

We would like to see how to avoid that such situations happen again. It seems it could be easily avoided by not letting the “full repack” phase unpack unreachable objects, since there is no reason to keep them.

Steps to reproduce

This should be easy to repro by creating unreachable objects in a repository and then running the gitlab housekeeping commands:

Create a new repo, clone, add a remote to another project (the other project will provide some content to push and then make unreachable)
Push a branch in the new repo with content from the other project and push
Server side: run an incremental pack (gitlab runs git -c repack.writeBitmaps=false repack -d) to make sure objects are packed
Delete the branch to create unreachable objects in the new repo
Server side: run a full repack (gitlab runs git -c repack.writeBitmaps=true repack -A -d --pack-kept-objects)

What is the current behavior?

All the unreachable objects are unpacked as individual files in objects, making performance worse than before housekeeping

What is the expected correct behavior?

The “full repack” phase optimizes storage

Possible fixes

Here’s a bit of research we’ve done while troubleshooting the issue:

Currently gitlab housekeeping runs this command:

git -c repack.writeBitmaps=true repack -A -d --pack-kept-objects

Using -a instead of -A would discard all these unreachable objects, but I suppose that may cause problems if the repack operation takes place concurrently with something writing new objects (push etc.). More interestingly, git gc internally runs a similar command, but with option --unpack-unreachable=2.weeks.ago (or whatever is configured in gc.pruneExpire) so only the most recent unreachable objects are kept. It may make sense that GitLab adds a similar parameter to the command. Perhaps values like 1.days.ago or 2.days.ago would be appropriate.

Yet another option may be to disable unpacking unreachable objects altogether with the --keep-unreachable option to git-repack. Unreachable objects will not be deleted (like happens in the current housekeeping process) but at least they will remain more efficiently stored in a pack rather than unpacked as many files.

Alternatively, we could work around the issue by configuring housekeeping to skip the full repack phase altogether, and only let gitlab housekeeping do (frequent) incremental repacks and (less frequent) gc. If the same period were configured for full repacks and gc (https://gitlab.com/gitlab-org/gitlab-ce/blob/master/app/services/projects/housekeeping_service.rb) would just perform gc as we’d like to. But parameter validation of the housekeeping settings does not currently let us set the same period for full repack and gc. (gc period must be > full repack period)