Significant noop increments in PK sequences for top_level_namespaces and repositories tables
Context
The container registry has a few API operations that are prone to race conditions. The main two are blob and manifest uploads. When pushing an image to the container registry, Docker will upload blobs in parallel. Similarly, multiple clients may attempt to upload the same manifest as well.
This requires us to protect against race conditions when these operations target new top-level namespaces (top_level_namespaces
table) and/or repositories (repositories
table).
If one client thread is uploading blob B1
to the top-level namespace N
and repository R
, while another thread is uploading blob B2
to the same N
and R
, then it is possible that when using a regular "find or create" approach, one of the threads will manage to create N
and/or R
before the other, which can lead to a "unique constraint violation" when the second one attempts to execute the "create".
To protect against this, when reviewing the database schema and queries with the Database team (#104 (closed)), we decided to go ahead with the most simple approach that would allow us to avoid such race conditions - using an INSERT INTO ... ON CONFLICT DO NOTHING
statement to emulate a "create or find" approach, as described in the documentation.
Problem
The "create or find" approach and the underlying INSERT INTO ... ON CONFLICT DO NOTHING
statement works great to avoid race conditions, but they come with a visible downside: the primary key sequence for these tables is incremented every time we try to create a record that already exists.
This is especially clear for the registry because:
-
Each upload is composed of
N
parts (blobs), so theON CONFLICT DO NOTHING
clause gets activatedN-1
times, incrementing the sequence by the same amount; -
Top-level namespaces and repositories are rarely created but frequently updated. This means that for the vast majority of the time we're triggering the
ON CONFLICT DO NOTHING
because the target records already exist; -
Repositories have a hierarchical relationship (e.g.
foo/bar
), and the "create or find" is used for each path segment, which makes this caveat exponentially worse.
The production workload/rate has now made this very obvious. Although we're using a bigint
for these incremental id
columns, it should be safe to optimize this and avoid other potential downsides of triggering the ON CONCLIFT DO NOTHING
clauses (such as dead tuples).
Additionally, I see that a couple of months ago we started recommending a "safe find or create" approach instead of the atomic "create or find" (docs).
This led me to chase an alternative now.
Example
Using the database locally we can pull an image from DockerHub, tag it, and push it to the registry multiple times using different target top-level namespaces (ns1
, ns2
and ns3
in this example):
docker pull redis:6.2.6
docker tag redis:6.2.6 0.0.0.0:5000/ns1:latest
docker push 0.0.0.0:5000/ns1:latest
The push refers to repository [0.0.0.0:5000/ns1]
146262eb3841: Pushed
0bd13b42de4d: Pushed
6b01cc47a390: Pushed
8b9770153666: Pushed
b43651130521: Pushed
e8b689711f21: Pushed
latest: digest: sha256:5d30f5c16e473549ad7c950b0ac3083039719b1c9749519c50e18017dd4bfc54 size: 1573
docker tag redis:6.2.6 0.0.0.0:5000/ns2:latest
docker push 0.0.0.0:5000/ns2:latest
docker tag redis:6.2.6 0.0.0.0:5000/ns3:latest
docker push 0.0.0.0:5000/ns3:latest
Note that this image has 6 layers and 1 manifest, which means the "create or find" query will be used 7 times during the upload. Looking at the database:
localhost postgres@registry=# select * from top_level_namespaces;
id | created_at | updated_at | name
----+-------------------------------+------------+--------
1 | 2021-11-08 14:17:55.090007+00 | [NULL] | ns1
8 | 2021-11-08 14:18:15.31768+00 | [NULL] | ns2
15 | 2021-11-08 14:18:48.529663+00 | [NULL] | ns3
We can see that the id
sequence was incremented by 7 between each docker push
.
Solution
We know the following in the context of the container registry:
- The vast majority of requests (+90%) target existing top-level namespaces and repositories;
- We never delete top-level namespaces or repository records from the database. So once created they are guaranteed to be there.
Considering the above, it seems safe to swap the "create or find" approach for a "safe find or create". For this, instead of doing a single atomic query - INSERT INTO ... ON CONFLICT DO NOTHING
, we can first search for the record and only execute the upsert if it does not yet exist.