Significant noop increments in PK sequences for top_level_namespaces and repositories tables

Context

The container registry has a few API operations that are prone to race conditions. The main two are blob and manifest uploads. When pushing an image to the container registry, Docker will upload blobs in parallel. Similarly, multiple clients may attempt to upload the same manifest as well.

This requires us to protect against race conditions when these operations target new top-level namespaces (top_level_namespaces table) and/or repositories (repositories table).

If one client thread is uploading blob B1 to the top-level namespace N and repository R, while another thread is uploading blob B2 to the same N and R, then it is possible that when using a regular "find or create" approach, one of the threads will manage to create N and/or R before the other, which can lead to a "unique constraint violation" when the second one attempts to execute the "create".

To protect against this, when reviewing the database schema and queries with the Database team (#104 (closed)), we decided to go ahead with the most simple approach that would allow us to avoid such race conditions - using an INSERT INTO ... ON CONFLICT DO NOTHING statement to emulate a "create or find" approach, as described in the documentation.

Problem

The "create or find" approach and the underlying INSERT INTO ... ON CONFLICT DO NOTHING statement works great to avoid race conditions, but they come with a visible downside: the primary key sequence for these tables is incremented every time we try to create a record that already exists.

This is especially clear for the registry because:

Each upload is composed of N parts (blobs), so the ON CONFLICT DO NOTHING clause gets activated N-1 times, incrementing the sequence by the same amount;
Top-level namespaces and repositories are rarely created but frequently updated. This means that for the vast majority of the time we're triggering the ON CONFLICT DO NOTHING because the target records already exist;
Repositories have a hierarchical relationship (e.g. foo/bar), and the "create or find" is used for each path segment, which makes this caveat exponentially worse.

The production workload/rate has now made this very obvious. Although we're using a bigint for these incremental id columns, it should be safe to optimize this and avoid other potential downsides of triggering the ON CONCLIFT DO NOTHING clauses (such as dead tuples).

Additionally, I see that a couple of months ago we started recommending a "safe find or create" approach instead of the atomic "create or find" (docs).

This led me to chase an alternative now.

Example

Using the database locally we can pull an image from DockerHub, tag it, and push it to the registry multiple times using different target top-level namespaces (ns1, ns2 and ns3 in this example):

docker pull redis:6.2.6

docker tag redis:6.2.6 0.0.0.0:5000/ns1:latest
docker push 0.0.0.0:5000/ns1:latest
The push refers to repository [0.0.0.0:5000/ns1]
146262eb3841: Pushed 
0bd13b42de4d: Pushed 
6b01cc47a390: Pushed 
8b9770153666: Pushed 
b43651130521: Pushed 
e8b689711f21: Pushed 
latest: digest: sha256:5d30f5c16e473549ad7c950b0ac3083039719b1c9749519c50e18017dd4bfc54 size: 1573

docker tag redis:6.2.6 0.0.0.0:5000/ns2:latest
docker push 0.0.0.0:5000/ns2:latest

docker tag redis:6.2.6 0.0.0.0:5000/ns3:latest
docker push 0.0.0.0:5000/ns3:latest

Note that this image has 6 layers and 1 manifest, which means the "create or find" query will be used 7 times during the upload. Looking at the database:

localhost postgres@registry=# select * from top_level_namespaces;
 id |          created_at           | updated_at |  name  
----+-------------------------------+------------+--------
  1 | 2021-11-08 14:17:55.090007+00 | [NULL]     | ns1
  8 | 2021-11-08 14:18:15.31768+00  | [NULL]     | ns2
 15 | 2021-11-08 14:18:48.529663+00 | [NULL]     | ns3

We can see that the id sequence was incremented by 7 between each docker push.

Solution

We know the following in the context of the container registry:

The vast majority of requests (+90%) target existing top-level namespaces and repositories;
We never delete top-level namespaces or repository records from the database. So once created they are guaranteed to be there.

Considering the above, it seems safe to swap the "create or find" approach for a "safe find or create". For this, instead of doing a single atomic query - INSERT INTO ... ON CONFLICT DO NOTHING, we can first search for the record and only execute the upsert if it does not yet exist.

Edited Nov 09, 2021 by João Pereira