Retry service discovery when it fails (!68672) · Merge requests · GitLab.org / GitLab

Simon Tomlinson requested to merge stomlinson/retry-service-discovery-on-failure into master Aug 20, 2021

What does this MR do?

This MR attempts to solve transient issues with postgres service discovery, particularly during the startup of new puma workers.

Before this MR, when service discovery either failed or returned an empty response, the puma worker would run for 60 seconds before attempting service discovery again, leading to 60-second spikes of no_secondaries_available warning messages (for example, here).

Since we have very few actual service discovery errors (see logging here), simply retrying in the case of an error seems like a reasonable solution.

This also changes empty responses from the nameserver to errors (which then trigger the retry logic), since we expect to always have secondaries available (currently 24 in production) and I believe we rarely incorrectly get empty responses from the nameserver, and want to retry service discovery in that case.

Relates to #323726 (closed).

Screenshots or Screencasts (strongly suggested)

How to setup and validate locally (strongly suggested)

There are a few steps required to run service discovery locally. This is the easiest way I've found.

Install dnsmasq to run a local dns nameserver.
- brew install dnsmasq (or your package manager of choice)
Start dnsmasq on port 53 (the default)
- sudo brew services start dnsmasq (needs sudo because of the privileged port)
Verify that dnsmasq is working
- dig @localhost localhost +short should return 127.0.0.1
Stop your local gdk postgresql
- gdk stop postgresql
Run postgresql directly so that it opens a tcp port.
- cd your-gdk-dir/postgresql/data && pg_ctl start -D .
Add the following production entry to your database.yml (this problem is only reproducible with RAILS_ENV=production)

production:
  main:
    adapter: postgresql
    encoding: unicode
    database: gitlabhq_development
    host: localhost
    port: 5432
    pool: 10
    prepared_statements: false
    variables:
      statement_timeout: 120s
    load_balancing:
      discover:
        nameserver: localhost
        port: 53
        record: localhost
        record_type: A
        interval: 60
        disconnect_timeout: 120

It's difficult to inject a failure in a local dns nameserver, so my solution when testing this was to insert a breakpoint and disable or re-enable my nameserver between loop iterations.

Add the following line:

diff --git a/lib/gitlab/database/load_balancing/service_discovery.rb b/lib/gitlab/database/load_balancing/service_discovery.rb
index d22134379da..e27e81f4a71 100644
--- a/lib/gitlab/database/load_balancing/service_discovery.rb
+++ b/lib/gitlab/database/load_balancing/service_discovery.rb
@@ -92,6 +92,7 @@ def perform_service_discovery
             Gitlab::AppLogger.error(
               "Service discovery encountered an error: #{error.message}"
             )
+            binding.pry
 
             # Slightly randomize the retry delay so that, in the case of a total
             # dns outage, all starting services do not pressure the dns server at the same time.

In one terminal, disable dnsmasq: sudo brew services stop dnsmasq

In another, run rails: env RAILS_ENV=production ENABLE_BOOTSNAP=true bundle exec puma

Wait for the breakpoint to be reached in the rescue block.

If you would like another failure to happen, simply continue in pry.

If you would like to recover, sudo brew services start dnsmasq before continuing.

Note that if you make service discovery fail 3 times here, puma will proceed to start worker threads, and they will all start running service discovery. You will then hit the pry breakpoint in multiple threads at once, and your debug session will be a mess and difficult to use.

Does this MR meet the acceptance criteria?

Conformity

I have included changelog trailers, or none are needed. (Does this MR need a changelog?)
I have added/updated documentation, or it's not needed. (Is documentation required?)
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?)
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?)
I have self-reviewed this MR per code review guidelines.
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines)
I have followed the style guides.
This change is backwards compatible across updates, or this does not apply.

Availability and Testing

I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.)
I have tested this MR in all supported browsers, or it's not needed.
I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.

Edited Aug 24, 2021 by Simon Tomlinson

Retry service discovery when it fails