Retry service discovery when it fails
What does this MR do?
This MR attempts to solve transient issues with postgres service discovery, particularly during the startup of new puma workers.
Before this MR, when service discovery either failed or returned an empty response, the puma worker would run for 60 seconds before attempting service discovery again, leading to 60-second spikes of no_secondaries_available
warning messages (for example, here).
Since we have very few actual service discovery errors (see logging here), simply retrying in the case of an error seems like a reasonable solution.
This also changes empty responses from the nameserver to errors (which then trigger the retry logic), since we expect to always have secondaries available (currently 24 in production) and I believe we rarely incorrectly get empty responses from the nameserver, and want to retry service discovery in that case.
Relates to #323726 (closed).
Screenshots or Screencasts (strongly suggested)
How to setup and validate locally (strongly suggested)
There are a few steps required to run service discovery locally. This is the easiest way I've found.
- Install
dnsmasq
to run a local dns nameserver.-
brew install dnsmasq
(or your package manager of choice)
-
- Start dnsmasq on port 53 (the default)
-
sudo brew services start dnsmasq
(needs sudo because of the privileged port)
-
- Verify that
dnsmasq
is working-
dig @localhost localhost +short
should return127.0.0.1
-
- Stop your local gdk postgresql
gdk stop postgresql
- Run postgresql directly so that it opens a tcp port.
cd your-gdk-dir/postgresql/data && pg_ctl start -D .
- Add the following production entry to your
database.yml
(this problem is only reproducible with RAILS_ENV=production)
production:
main:
adapter: postgresql
encoding: unicode
database: gitlabhq_development
host: localhost
port: 5432
pool: 10
prepared_statements: false
variables:
statement_timeout: 120s
load_balancing:
discover:
nameserver: localhost
port: 53
record: localhost
record_type: A
interval: 60
disconnect_timeout: 120
It's difficult to inject a failure in a local dns nameserver, so my solution when testing this was to insert a breakpoint and disable or re-enable my nameserver between loop iterations.
Add the following line:
diff --git a/lib/gitlab/database/load_balancing/service_discovery.rb b/lib/gitlab/database/load_balancing/service_discovery.rb
index d22134379da..e27e81f4a71 100644
--- a/lib/gitlab/database/load_balancing/service_discovery.rb
+++ b/lib/gitlab/database/load_balancing/service_discovery.rb
@@ -92,6 +92,7 @@ def perform_service_discovery
Gitlab::AppLogger.error(
"Service discovery encountered an error: #{error.message}"
)
+ binding.pry
# Slightly randomize the retry delay so that, in the case of a total
# dns outage, all starting services do not pressure the dns server at the same time.
In one terminal, disable dnsmasq: sudo brew services stop dnsmasq
In another, run rails: env RAILS_ENV=production ENABLE_BOOTSNAP=true bundle exec puma
Wait for the breakpoint to be reached in the rescue
block.
If you would like another failure to happen, simply continue
in pry.
If you would like to recover, sudo brew services start dnsmasq
before continuing.
Note that if you make service discovery fail 3 times here, puma will proceed to start worker threads, and they will all start running service discovery. You will then hit the pry breakpoint in multiple threads at once, and your debug session will be a mess and difficult to use.
Does this MR meet the acceptance criteria?
Conformity
-
I have included changelog trailers, or none are needed. (Does this MR need a changelog?) -
I have added/updated documentation, or it's not needed. (Is documentation required?) -
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?) -
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?) -
I have self-reviewed this MR per code review guidelines. -
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines) -
I have followed the style guides. -
This change is backwards compatible across updates, or this does not apply.
Availability and Testing
-
I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.) -
I have tested this MR in all supported browsers, or it's not needed. -
I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.