Add tool to scan and import filesystem metadata into the database
Context
This MR resolves #56 (closed). It adds a tool to populate the registry metadata database by scanning the repository filesystem and importing the relevant metadata.
Solution
This MR introduces a new database
sub command for the registry binary, named import
:
$ registry database import --help
Import filesystem metadata into the database. This should only be used for an initial one-off
migration, starting with an empty database. Dangling blobs are not imported, only referenced
ones. This tool is not concurrency safe.
Usage:
registry database import [flags]
Flags:
-d, --dry-run=false: do not commit changes to the database
-h, --help=false: help for import
This command provides a dry run mode so that it can be tested without committing changes to the database.
Limitations and Possible Improvements
-
Currently we're wrapping the whole import in a single transaction. This is not ideal because it means that we can't retry imports and the database transaction log will grow considerable for really large repositories, placing pressure on the system due to high I/O. This will be addressed in a separate issue (#86 (closed)).
-
The import isn't concurrency safe. This means that we can't walk the filesystem in parallel when scanning repositories. Mixing a concurrent storage walk with concurrent database transactions is notoriously difficult, especially as we intend to record and maintain the repositories hierarchical relationship. This will be investigated and addressed in a separate issue (#85 (closed)). The non-concurrent import is enough for our immediate needs.
-
The metadata about dangling blobs is not imported, only referenced ones. In future we should provide a flag to import dangling blobs as well (#87 (closed)).
Testing
The registry.datastore.Importer
tool is tested using Table Driven Tests, Fixtures and Golden Files.
Fixtures
A sample repository was created in registry/datastore/testdata/fixtures/importer
. This repository contains multiple repositories, created to simulate all possible scenarios, including:
-
docker/registry/v2/repositories/a-simple
: A simple repository with only one manifest, one layer and two tags pointing to the same manifest. -
docker/registry/v2/repositories/b-nested
: A set of nested repositories that share some layers between them. -
docker/registry/v2/repositories/c-manifest-list
: A repository with platform specific manifests and an aggregating manifest list. -
docker/registry/v2/repositories/d-schema1
: A repository with a deprecated schema 1 manifest. -
docker/registry/v2/repositories/e-helm
: A repository with an Helm chart that also serves as test for OCI manifests.
When running the tests, a filesystem storage driver is configured to use the testdata/fixtures/importer
repository as source.
For compactness, all layer blobs were truncated to a couple of K in size, the whole repository is less than <500K in size.
Golden Files
Running the metadata import against the test repository we have generates a considerable amount of data on the database. It's not practical or maintainable to define structs for all expected rows and then compare them one by one.
The only thing we need to assert here is that the import process works as expected. The "work as expected" means that once complete, the database tables look exactly like a set of manually pre-validated snapshots/dumps (in JSON format for easy readability), i.e. golden files.
Therefore, instead of comparing item by item, we simply saved the expected database tables content to .golden
files within registry/datastore/testdata/golden/TestImporter_Import
, where files are named as <table>.golden
.
Test Flags
To facilitate the development process, two new flags for the go test
command were added:
-
update
: Updates existing golden files with a new expected value. For example, if we change a column name, it's impractical to update the golden files manually. With this command the golden files are automatically updated with a fresh dump that reflects the new column name. -
create
: Create missing golden files, followed byupdate
. In case we add new tables, new golden files need to be created for them. Instead of creating them manually, we can use this flag and they will be automatically created and updated with the current table content.
Test Helpers
New helper functions were added to the registry.datastore.testutil
package. These provide the required logic to handle the creation (createGoldenFile
), update (updateGoldenFile
) and read (readGoldenFile
) of golden files (all internal functions), as well as the comparison of a given value with the contents of a golden file (public function). This comparison function is named CompareWithGoldenFile
:
// CompareWithGoldenFile compares an actual value with the content of a .golden file. If requested, a missing golden
// file is automatically created and an outdated golden file automatically updated to match the actual content.
CompareWithGoldenFile(tb testing.TB, path string, actual []byte, create, update bool)
The caller (specific test) is responsible for passing the value of the update
and create
go test
flags to this function.
Table Driven Tests
Table driven tests are used to tie everything together. Instead of multiple test functions, we use a single TestImporter_Import
test with a sub test for each table.
The test starts by building a test registry and storage driver, and then running the Import
method of the registry.datastore.Importer
struct. This will populate the database with the metadata of the fixture repository and should complete without errors.
Once done, we use table tests to validate each table and do the following steps for each:
- Create a new sub test, using the table name as test name;
- Dump the corresponding table content as a JSON payload, using the PostgreSQL
json_agg
aggregate function; - Compare the obtained dump with the corresponding golden file. These must match.
Output
Success
A success output looks as follows:
--- PASS: TestImporter_Import (0.16s)
=== RUN TestImporter_Import/repositories
--- PASS: TestImporter_Import/repositories (0.00s)
=== RUN TestImporter_Import/manifest_configurations
--- PASS: TestImporter_Import/manifest_configurations (0.00s)
=== RUN TestImporter_Import/manifests
--- PASS: TestImporter_Import/manifests (0.00s)
=== RUN TestImporter_Import/repository_manifests
--- PASS: TestImporter_Import/repository_manifests (0.00s)
=== RUN TestImporter_Import/layers
--- PASS: TestImporter_Import/layers (0.00s)
=== RUN TestImporter_Import/manifest_layers
--- PASS: TestImporter_Import/manifest_layers (0.00s)
=== RUN TestImporter_Import/manifest_lists
--- PASS: TestImporter_Import/manifest_lists (0.00s)
=== RUN TestImporter_Import/manifest_list_items
--- PASS: TestImporter_Import/manifest_list_items (0.00s)
=== RUN TestImporter_Import/repository_manifest_lists
--- PASS: TestImporter_Import/repository_manifest_lists (0.00s)
=== RUN TestImporter_Import/tags
--- PASS: TestImporter_Import/tags (0.00s)
PASS
Golden File Mismatch
If the obtained dump and the golden file content don't match, a nice diff is presented:
Diff:
--- Expected
+++ Actual
@@ -1,3 +1,3 @@
[{"id":1,"name":"a-simple","path":"a-simple","parent_id":null,"created_at":"2020-04-15T12:04:28.95584","deleted_at":null},
- {"id":2,"name":"b-nestedddddddd","path":"b-nested","parent_id":null,"created_at":"2020-04-15T12:04:28.95584","deleted_at":null},
+ {"id":2,"name":"b-nested","path":"b-nested","parent_id":null,"created_at":"2020-04-15T12:04:28.95584","deleted_at":null},
{"id":3,"name":"older","path":"b-nested/older","parent_id":2,"created_at":"2020-04-15T12:04:28.95584","deleted_at":null},
Test: TestImporter_Import/repositories
Messages: does not match .golden file
--- FAIL: TestImporter_Import/repositories (0.00s)
Caveat
Some values in the JSON dumps vary between runs. For example, the created_at
timestamps of each row is different between test runs (as rows are created with every run). To be able to have reproducible comparisons against the golden files we must override this varying attributes with a fixed value (that matches the one in the golden files).
Fortunately this is only necessary for the created_at
attribute of all entities and two fields for signed Schema 1 manifest payloads (the signature varies with every run). To make this possible, a utility overrideDynamicData
function was created and inside it all varying fields are override using regular expressions.
Overall, the benefits of this approach largely outweigh this caveat.
Next Steps
This tool will be used to extrapolate database query rate and size requirements.
We will also look at implementing the improvements described above.