Add tool to scan and import filesystem metadata into the database (!136) · Merge requests · GitLab.org / container-registry

João Pereira requested to merge db-populate-command into database Apr 07, 2020

Context

This MR resolves #56 (closed). It adds a tool to populate the registry metadata database by scanning the repository filesystem and importing the relevant metadata.

Solution

This MR introduces a new database sub command for the registry binary, named import:

 $ registry database import --help
Import filesystem metadata into the database. This should only be used for an initial one-off
migration, starting with an empty database. Dangling blobs are not imported, only referenced 
ones. This tool is not concurrency safe.

Usage: 
  registry database import [flags]
Flags:
  -d, --dry-run=false: do not commit changes to the database
  -h, --help=false: help for import

This command provides a dry run mode so that it can be tested without committing changes to the database.

Limitations and Possible Improvements

Currently we're wrapping the whole import in a single transaction. This is not ideal because it means that we can't retry imports and the database transaction log will grow considerable for really large repositories, placing pressure on the system due to high I/O. This will be addressed in a separate issue (#86 (closed)).
The import isn't concurrency safe. This means that we can't walk the filesystem in parallel when scanning repositories. Mixing a concurrent storage walk with concurrent database transactions is notoriously difficult, especially as we intend to record and maintain the repositories hierarchical relationship. This will be investigated and addressed in a separate issue (#85 (closed)). The non-concurrent import is enough for our immediate needs.
The metadata about dangling blobs is not imported, only referenced ones. In future we should provide a flag to import dangling blobs as well (#87 (closed)).

Testing

The registry.datastore.Importer tool is tested using Table Driven Tests, Fixtures and Golden Files.

Fixtures

A sample repository was created in registry/datastore/testdata/fixtures/importer. This repository contains multiple repositories, created to simulate all possible scenarios, including:

docker/registry/v2/repositories/a-simple: A simple repository with only one manifest, one layer and two tags pointing to the same manifest.
docker/registry/v2/repositories/b-nested: A set of nested repositories that share some layers between them.
docker/registry/v2/repositories/c-manifest-list: A repository with platform specific manifests and an aggregating manifest list.
docker/registry/v2/repositories/d-schema1: A repository with a deprecated schema 1 manifest.
docker/registry/v2/repositories/e-helm: A repository with an Helm chart that also serves as test for OCI manifests.

When running the tests, a filesystem storage driver is configured to use the testdata/fixtures/importer repository as source.

For compactness, all layer blobs were truncated to a couple of K in size, the whole repository is less than <500K in size.

Golden Files

Running the metadata import against the test repository we have generates a considerable amount of data on the database. It's not practical or maintainable to define structs for all expected rows and then compare them one by one.

The only thing we need to assert here is that the import process works as expected. The "work as expected" means that once complete, the database tables look exactly like a set of manually pre-validated snapshots/dumps (in JSON format for easy readability), i.e. golden files.

Therefore, instead of comparing item by item, we simply saved the expected database tables content to .golden files within registry/datastore/testdata/golden/TestImporter_Import, where files are named as <table>.golden.

Test Flags

To facilitate the development process, two new flags for the go test command were added:

update: Updates existing golden files with a new expected value. For example, if we change a column name, it's impractical to update the golden files manually. With this command the golden files are automatically updated with a fresh dump that reflects the new column name.
create: Create missing golden files, followed by update. In case we add new tables, new golden files need to be created for them. Instead of creating them manually, we can use this flag and they will be automatically created and updated with the current table content.

Test Helpers

New helper functions were added to the registry.datastore.testutil package. These provide the required logic to handle the creation (createGoldenFile), update (updateGoldenFile) and read (readGoldenFile) of golden files (all internal functions), as well as the comparison of a given value with the contents of a golden file (public function). This comparison function is named CompareWithGoldenFile:

// CompareWithGoldenFile compares an actual value with the content of a .golden file. If requested, a missing golden
// file is automatically created and an outdated golden file automatically updated to match the actual content.
CompareWithGoldenFile(tb testing.TB, path string, actual []byte, create, update bool)

The caller (specific test) is responsible for passing the value of the update and create go test flags to this function.

Table Driven Tests

Table driven tests are used to tie everything together. Instead of multiple test functions, we use a single TestImporter_Import test with a sub test for each table.

The test starts by building a test registry and storage driver, and then running the Import method of the registry.datastore.Importer struct. This will populate the database with the metadata of the fixture repository and should complete without errors.

Once done, we use table tests to validate each table and do the following steps for each:

Create a new sub test, using the table name as test name;
Dump the corresponding table content as a JSON payload, using the PostgreSQL json_agg aggregate function;
Compare the obtained dump with the corresponding golden file. These must match.

Output

Success

A success output looks as follows:

--- PASS: TestImporter_Import (0.16s)
=== RUN   TestImporter_Import/repositories
    --- PASS: TestImporter_Import/repositories (0.00s)
=== RUN   TestImporter_Import/manifest_configurations
    --- PASS: TestImporter_Import/manifest_configurations (0.00s)
=== RUN   TestImporter_Import/manifests
    --- PASS: TestImporter_Import/manifests (0.00s)
=== RUN   TestImporter_Import/repository_manifests
    --- PASS: TestImporter_Import/repository_manifests (0.00s)
=== RUN   TestImporter_Import/layers
    --- PASS: TestImporter_Import/layers (0.00s)
=== RUN   TestImporter_Import/manifest_layers
    --- PASS: TestImporter_Import/manifest_layers (0.00s)
=== RUN   TestImporter_Import/manifest_lists
    --- PASS: TestImporter_Import/manifest_lists (0.00s)
=== RUN   TestImporter_Import/manifest_list_items
    --- PASS: TestImporter_Import/manifest_list_items (0.00s)
=== RUN   TestImporter_Import/repository_manifest_lists
    --- PASS: TestImporter_Import/repository_manifest_lists (0.00s)
=== RUN   TestImporter_Import/tags
    --- PASS: TestImporter_Import/tags (0.00s)
PASS

Golden File Mismatch

If the obtained dump and the golden file content don't match, a nice diff is presented:

        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,3 +1,3 @@
        	            	 [{"id":1,"name":"a-simple","path":"a-simple","parent_id":null,"created_at":"2020-04-15T12:04:28.95584","deleted_at":null}, 
        	            	- {"id":2,"name":"b-nestedddddddd","path":"b-nested","parent_id":null,"created_at":"2020-04-15T12:04:28.95584","deleted_at":null},
        	            	+ {"id":2,"name":"b-nested","path":"b-nested","parent_id":null,"created_at":"2020-04-15T12:04:28.95584","deleted_at":null}, 
        	            	  {"id":3,"name":"older","path":"b-nested/older","parent_id":2,"created_at":"2020-04-15T12:04:28.95584","deleted_at":null}, 
        	Test:       	TestImporter_Import/repositories
        	Messages:   	does not match .golden file
    --- FAIL: TestImporter_Import/repositories (0.00s)

Caveat

Some values in the JSON dumps vary between runs. For example, the created_at timestamps of each row is different between test runs (as rows are created with every run). To be able to have reproducible comparisons against the golden files we must override this varying attributes with a fixed value (that matches the one in the golden files).

Fortunately this is only necessary for the created_at attribute of all entities and two fields for signed Schema 1 manifest payloads (the signature varies with every run). To make this possible, a utility overrideDynamicData function was created and inside it all varying fields are override using regular expressions.

Overall, the benefits of this approach largely outweigh this caveat.

Next Steps

This tool will be used to extrapolate database query rate and size requirements.

We will also look at implementing the improvements described above.

Edited Apr 17, 2020 by João Pereira

Add tool to scan and import filesystem metadata into the database