Accurate approach for getting out of date repositories
Current approach for determining outdated repositories from the replication jobs is simplistic. We check whether the latest replication job is from the previous writable primary and that it is in 'completed' state. This has multiple downsides though:
-
The node is considered outdated if it is reconciled from any other node than the last writable primary. This causes
dataloss
to report unreplicated writes even if the source node was up to date. This composes badly withreconcile
as reconcile only schedules replication jobs if the node is inconsistent. If a node was brought up to date from a secondary, reconciling later from last writable primary to the node does not mark it as consistent indataloss
as no replication job is created. -
If the repository has not received any writes after a second failover, the last replication job won't be from the previous writable primary but the one before that. This will be also reported as outdated with the current approach. This over reporting can't be fixed by reconciling, as no replication jobs are produced since the repositories are actually up to date.
@zj-gitlab @pokstad1 I think these problems hinder the usefulness of the dataloss
command quite a bit. This will also cause problems for #2717 (closed) as we can't schedule replication jobs from the secondaries without knowing they are up to date. In !2256 (closed) I had a proof of concept of following the replication jobs to the last writable primary to figure out if a node was brought up to date from a secondary. While it solves problem 1, I realized it doesn't solve problem 2 as the previous writable primary would not be in the chain if there were no further writes before the second failover.
A fairly straighforward way to solve both of these problems could be to store an identifier for each write. To do this, we need a table where we can store the version identifier per repository.
CREATE TABLE repository_versions (
virtual_storage TEXT PRIMARY KEY,
storage TEXT PRIMARY KEY,
relative_path TEXT PRIMARY KEY,
version BIGINT -- incremented on each write
)
When a primary receives a write and we create the replication jobs, we'll update the primary's entry in the table and also include the the version identifier in the replication jobs. When the replication jobs are acknowledged as completed, we'll update the value on the target node's entry in the table to match what was in the replication job. Completing a replication job guarantees the repository is now at least the same version as the write that produced the replication job. The repo might actually be of a newer version if there were new writes to the primary in the mean while and the replication also pulled those changes in. However, this inconsistency should then be later solved when the replication job of the new version is applied.
Reconcile should also take the current version from the database and add it to the replication jobs it creates in order to propagate the version.
As new writes produce a new version and replication jobs propagate a version, getting the information about outdated repositories becomes simple. After a failover, we can simply query this table and find repositories which do not have the latest version of the repository. Since the version is always increasing, we can drop the knowledge of the previous writable primary and simply use the highest version. The previous writable primary would always have the highest version number (unless reconciled to and overwritten).
This also supports #2862 (closed) as we'll have a version per repository. We can simply check if the new primary is on the highest version, if not, reject writes until the data loss is acknowledged or fixed.