Skip to content

chore(datastore): add ability to elect DLB replica based on replication lag

João Pereira requested to merge 1306-3 into master

What does this MR do?

Related to DLB: Implement primary sticking (#1306 - closed). This adds a new UpToDateReplica method to the DB load balancer entity so that a candidate replica is checked for replication lag against a previously recorded primary LSN for a given repository (!1694 (merged)), in conformance with the (spec).

How to test locally

  • Setup your local environment as described in docs/database-local-setup.md#fixed-hosts. Note that I'm updating this doc in this MR with relevant changes.

  • Tail your GDK PostgreSQL logs:

    $ gdk tail postgresql*
  • Apply the following patch:

    diff --git a/registry/handlers/app.go b/registry/handlers/app.go
    index caffff638..b9e99854d 100644
    --- a/registry/handlers/app.go
    +++ b/registry/handlers/app.go
    @@ -10,6 +10,7 @@ import (
        "errors"
        "expvar"
        "fmt"
    +	"github.com/docker/distribution/registry/datastore/models"
        "io"
        "math/rand"
        "net"
    @@ -388,13 +389,21 @@ func NewApp(ctx context.Context, config *configuration.Configuration) (*App, err
                    dbOpts = append(dbOpts, datastore.WithFixedHosts(hosts))
                }
            }
    -
    +		dbOpts = append(dbOpts, datastore.WithLSNCache(datastore.NewCentralRepositoryCache(app.redisCache)))
            db, err := datastore.NewDBLoadBalancer(ctx, dsn, dbOpts...)
            if err != nil {
                return nil, fmt.Errorf("failed to initialize database connections: %w", err)
            }
            startDBReplicaChecking(ctx, db)
    
    +		repo := &models.Repository{Path: "test/repo"}
    +		if err := db.RecordLSN(ctx, repo); err != nil {
    +			panic(err)
    +		}
    +		if err := db.UpToDateReplica(ctx, repo).DB.PingContext(ctx); err != nil {
    +			panic(err)
    +		}
    +
            // Skip postdeployment migrations to prevent pending post deployment
            // migrations from preventing the registry from starting.
            m := migrations.NewMigrator(db.Primary().DB, migrations.SkipPostDeployment)
  • Compile and start the registry

  • Check the output of the GDK logs. You should see something like this:

    2024-07-31_18:54:33.53407 postgresql            : 2024-07-31 19:54:33.534 WEST [69684] LOG:  statement: SELECT pg_current_wal_insert_lsn()
    2024-07-31_18:54:33.63855 postgresql-replica-2  : 2024-07-31 19:54:33.638 WEST [69666] LOG:  statement:
    2024-07-31_18:54:33.63858 postgresql-replica-2  :               WITH replica_lsn AS (
    2024-07-31_18:54:33.63859 postgresql-replica-2  :                               SELECT pg_last_wal_replay_lsn () AS lsn
    2024-07-31_18:54:33.63860 postgresql-replica-2  :                       )
    2024-07-31_18:54:33.63861 postgresql-replica-2  :                       SELECT
    2024-07-31_18:54:33.63861 postgresql-replica-2  :                               pg_wal_lsn_diff ( '0/59CE00F0' ::pg_lsn, lsn) <= 0
    2024-07-31_18:54:33.63861 postgresql-replica-2  :                       FROM
    2024-07-31_18:54:33.63862 postgresql-replica-2  :                               replica_lsn
    2024-07-31_18:54:33.65163 postgresql-replica-2  : 2024-07-31 19:54:33.651 WEST [69666] LOG:  statement: -- ping
  • If you have the redis-cli installed, you can also double check the key there:

    redis-cli -s /<full path to gdk root>/redis/redis.socket
    redis /<full path to gdk root>/redis/redis.socket> KEYS "registry:*"
    1) "registry:db:{repository:test:c3ecf330c6173bf445635647db26f09843444527b55b3a0f5d5223d64045d378}:lsn"
    redis /<full path to gdk root>/redis/redis.socket> GET "registry:db:{repository:test:c3ecf330c6173bf445635647db26f09843444527b55b3a0f5d5223d64045d378}:lsn"
    "0/59CE00F0"

Author checklist

  • Feature flags
    • Added feature flag:
    • This feature does not require a feature flag
  • I added unit tests or they are not required
  • I added documentation (or it's not required)
  • I followed code review guidelines
  • I followed Go Style guidelines
  • For database changes including schema migrations:
    • Manually run up and down migrations in a postgres.ai production database clone and post a screenshot of the result here.
    • If adding new queries, extract a query plan from postgres.ai and post the link here. If changing existing queries, also extract a query plan for the current version for comparison.
      • I do not have access to postgres.ai and have made a comment on this MR asking for these to be run on my behalf.
    • Do not include code that depends on the schema migrations in the same commit. Split the MR into two or more.
  • Ensured this change is safe to deploy to individual stages in the same environment (cny -> prod). State-related changes can be troublesome due to having parts of the fleet processing (possibly related) requests in different ways.

Reviewer checklist

  • Ensure the commit and MR tittle are still accurate.
  • If the change contains a breaking change, apply the breaking change label.
  • If the change is considered high risk, apply the label high-risk-change
  • Identify if the change can be rolled back safely. (note: all other reasons for not being able to rollback will be sufficiently captured by major version changes).

If the MR introduces database schema migrations:

  • Ensure the commit and MR tittle start with fix:, feat:, or perf: so that the change appears on the Changelog
If the changes cannot be rolled back follow these steps:
  • If not, apply the label cannot-rollback.
  • Add a section to the MR description that includes the following details:
    • The reasoning behind why a release containing the presented MR can not be rolled back (e.g. schema migrations or changes to the FS structure)
    • Detailed steps to revert/disable a feature introduced by the same change where a migration cannot be rolled back. (note: ideally MRs containing schema migrations should not contain feature changes.)
    • Ensure this MR does not add code that depends on these changes that cannot be rolled back.

Related to #1306 (closed)

Edited by João Pereira

Merge request reports

Loading