Ensure service discovery runs before secondaries are used
What does this MR do?
We use service discovery to dynamically find the ip addresses to use for postgres secondaries.
Before this change, when Gitlab::Database::LoadBalancing.start_service_discovery
ran in config/initializers/load_balancing.rb, it started the first round of service discovery in a background thread. This created an opportunity for initialization to continue, including starting database requests that should use a secondary, before the list of secondaries was populated.
As a result, we see many instances of the no_secondaries_available
warning here.
Here is an example log search showing a burst of these queries during application startup for a specific instance: https://log.gprd.gitlab.net/goto/4a4e74e81ed277cbe567da400f284952.
To stop this race condition, this MR simply runs the initial round of service discovery before spawning the worker thread, ensuring that the hosts list is populated before application startup continues.
Relates to #323726 (closed)
Screenshots or Screencasts (strongly suggested)
Before making this change, I ran rails with RAILS_ENV=test
and added a log line where the hosts are loaded, along with a 0.2 second sleep before the first service discovery call (because my local dns is extremely fast).
I saw the following (note that the set of queries that run before service discovery starts have always run against the primary and won't cause any errors, since service discovery hasn't been set up yet):
test.log file showing race condition
(0.2ms) SELECT VERSION() /*application:test*/ ↳ lib/gitlab/database/connection.rb:104:in `database_version' License Load (0.5ms) SELECT "licenses".* FROM "licenses" ORDER BY "licenses"."id" DESC LIMIT 100 /*application:test*/ ↳ lib/gitlab/json_cache.rb:51:in `fetch' ApplicationSetting Load (2.7ms) SELECT "application_settings".* FROM "application_settings" ORDER BY "application_settings"."id" DESC LIMIT 1 /*application:test*/ ↳ app/models/concerns/cacheable_attributes.rb:19:in `current_without_cache' Creating scope :group_view_details. Overwriting existing method User.group_view_details. (1.3ms) SELECT "schema_migrations"."version" FROM "schema_migrations" ORDER BY "schema_migrations"."version" ASC /*application:test*/ ↳ config/initializers/fill_shards.rb:6:in `' GeoNode Exists? (0.8ms) SELECT 1 AS one FROM "geo_nodes" LIMIT 1 /*application:test*/ ↳ ee/lib/gitlab/geo.rb:49:in `block in enabled?' (0.7ms) SELECT "shards"."name" FROM "shards" /*application:test*/ ↳ app/models/shard.rb:12:in `populate!' ******* Here is where service discovery starts ****** Gitlab::Database::PostgresPartition Load (4.6ms) SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'audit_events')) ORDER BY "postgres_partitions"."name" ASC /*application:test*/ ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer' Feature::FlipperGate Load (0.7ms) SELECT "feature_gates".* FROM "feature_gates" WHERE "feature_gates"."feature_key" = 'partition_pruning_dry_run' /*application:test*/ ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer' Gitlab::Database::PostgresPartition Load (2.4ms) SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'audit_events')) ORDER BY "postgres_partitions"."name" ASC /*application:test*/ ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer' Gitlab::Database::PostgresPartition Load (3.0ms) SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'web_hook_logs')) ORDER BY "postgres_partitions"."name" ASC /*application:test*/ ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer' Gitlab::Database::PostgresPartition Load (2.4ms) SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'web_hook_logs')) ORDER BY "postgres_partitions"."name" ASC /*application:test*/ ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer' Gitlab::Database::PostgresPartition Load (2.7ms) SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'incident_management_pending_alert_escalations')) ORDER BY "postgres_partitions"."name" ASC /*application:test*/ ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer' Gitlab::Database::PostgresPartition Load (2.2ms) SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'incident_management_pending_alert_escalations')) ORDER BY "postgres_partitions"."name" ASC /*application:test*/ ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer' Raven 3.1.2 configured not to capture errors: DSN not set ********* Here is where service discovery completes ******** REPLACING HOSTS Creating scope :without_statuses. Overwriting existing method CommitStatus.without_statuses. Creating scope :in_pipelines. Overwriting existing method Ci::Build.in_pipelines.
After making this change, I repeated the same procedure, to show that the race condition is fixed:
test.log file showing race condition fixed
(0.2ms) SELECT VERSION() /*application:test*/ ↳ lib/gitlab/database/connection.rb:104:in `database_version' License Load (0.4ms) SELECT "licenses".* FROM "licenses" ORDER BY "licenses"."id" DESC LIMIT 100 /*application:test*/ ↳ lib/gitlab/json_cache.rb:51:in `fetch' ApplicationSetting Load (3.1ms) SELECT "application_settings".* FROM "application_settings" ORDER BY "application_settings"."id" DESC LIMIT 1 /*application:test*/ ↳ app/models/concerns/cacheable_attributes.rb:19:in `current_without_cache' Creating scope :group_view_details. Overwriting existing method User.group_view_details. (2.0ms) SELECT "schema_migrations"."version" FROM "schema_migrations" ORDER BY "schema_migrations"."version" ASC /*application:test*/ ↳ config/initializers/fill_shards.rb:6:in `' GeoNode Exists? (0.8ms) SELECT 1 AS one FROM "geo_nodes" LIMIT 1 /*application:test*/ ↳ ee/lib/gitlab/geo.rb:49:in `block in enabled?' (0.7ms) SELECT "shards"."name" FROM "shards" /*application:test*/ ↳ app/models/shard.rb:12:in `populate!' *********** Here service discovery both starts and completes (since it runs in the main thread) ************** REPLACING HOSTS Gitlab::Database::PostgresPartition Load (4.9ms) SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'audit_events')) ORDER BY "postgres_partitions"."name" ASC /*application:test*/ ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer' Feature::FlipperGate Load (0.8ms) SELECT "feature_gates".* FROM "feature_gates" WHERE "feature_gates"."feature_key" = 'partition_pruning_dry_run' /*application:test*/ ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer' CACHE Gitlab::Database::PostgresPartition Load (0.0ms) SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'audit_events')) ORDER BY "postgres_partitions"."name" ASC ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer' Gitlab::Database::PostgresPartition Load (2.9ms) SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'web_hook_logs')) ORDER BY "postgres_partitions"."name" ASC /*application:test*/ ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer' CACHE Gitlab::Database::PostgresPartition Load (0.0ms) SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'web_hook_logs')) ORDER BY "postgres_partitions"."name" ASC ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer' Gitlab::Database::PostgresPartition Load (2.6ms) SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'incident_management_pending_alert_escalations')) ORDER BY "postgres_partitions"."name" ASC /*application:test*/ ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer' CACHE Gitlab::Database::PostgresPartition Load (0.0ms) SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'incident_management_pending_alert_escalations')) ORDER BY "postgres_partitions"."name" ASC ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer' Raven 3.1.2 configured not to capture errors: DSN not set Creating scope :without_statuses. Overwriting existing method CommitStatus.without_statuses. Creating scope :in_pipelines. Overwriting existing method Ci::Build.in_pipelines.
How to setup and validate locally (strongly suggested)
Enabling service discovery locally is a bit complicated. I took the following steps:
-
brew install dnsmasq
(dnsmasq is a local dns nameserver) -
sudo brew services start dnsmasq
to start dnsmasq onlocalhost:53
-
verify with
dig @localhost localhost +short
, which should return 127.0.0.1 -
Stop postgresql with
gdk stop postgresql
-
restart postgresql so it exposes a tcp port:
cd <gdk-root>/postgresql/data && pg_ctl start -D .
-
Change the
development
entry inconfig/database.yml
to the following (comment out what was there before):
config/database.yml
development: main: adapter: postgresql encoding: unicode database: gitlabhq_development host: localhost port: 5432 pool: 10 prepared_statements: false variables: statement_timeout: 120s load_balancing: discover: nameserver: localhost record: localhost # We tell gitlab to use the same database as a secondary, but it doesn't know that. record_type: A port: 53 interval: 60 disconnect_timeout: 120
- Do any testing you want to do
To undo these changes and get your system back to the way it started, you can do the following:
-
sudo brew services stop dnsmasq && brew uninstall dnsmasq
-
cd <gdk-root>/postgresql/data && pg_ctl stop -D .
-
gdk start postgresql
-
revert the changes to your
config/database.yml
Does this MR meet the acceptance criteria?
Conformity
-
I have included changelog trailers, or none are needed. (Does this MR need a changelog?) -
I have added/updated documentation, or it's not needed. (Is documentation required?) -
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?) -
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?) -
I have self-reviewed this MR per code review guidelines. -
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines) -
I have followed the style guides. -
This change is backwards compatible across updates, or this does not apply.
Availability and Testing
Since this change only moves a single round of service discovery to a different thread, I believe it is relatively safe.
-
I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.) -
I have tested this MR in all supported browsers, or it's not needed. -
I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.