NoMethodError: undefined method `load_balancer' for nil:NilClass

https://sentry.gitlab.net/gitlab/gitlabcom/issues/1075504/

Around 100 out of 1200 requests to this API node failed with 500

NoMethodError: undefined method `load_balancer' for nil:NilClass
  gitlab/database/load_balancing/sticking.rb:83:in `load_balancer'
    LoadBalancing.proxy.load_balancer
  gitlab/database/load_balancing/sticking.rb:36:in `all_caught_up?'
    load_balancer.all_caught_up?(location).tap do |caught_up|
  gitlab/database/load_balancing/sticking.rb:44:in `unstick_or_continue_sticking'
    Session.current.use_primary! unless all_caught_up?(namespace, id)
  gitlab/database/load_balancing/rack_middleware.rb:23:in `stick_or_unstick'
    Sticking.unstick_or_continue_sticking(namespace, id)
  ee/api/helpers.rb:34:in `block in current_user'
    .stick_or_unstick(env, :user, user.id)
...
(140 additional frame(s) were not displayed)

NoMethodError: undefined method `load_balancer' for nil:NilClass
  gitlab/database/load_balancing/sticking.rb:83:in `load_balancer'
    LoadBalancing.proxy.load_balancer
  gitlab/database/load_balancing/sticking.rb:36:in `all_caught_up?'
    load_balancer.all_caught_up?(location).tap do |caught_up|
  gitlab/database/load_balancing/sticking.rb:44:in `unstick_or_continue_sticking'
    Session.current.use_primary! unless all_caught_up?(namespace, id)
  gitlab/database/load_balancing/rack_middleware.rb:23:in `stick_or_unstick'
    Sticking.unstick_or_continue_sticking(namespace, id)
  ee/api/helpers.rb:34:in `block in current_user'
    .stick_or_unstick(env, :user, user.id)
...
(117 additional frame(s) were not displayed)

NoMethodError: undefined method `load_balancer' for nil:NilClass

Workaround

Restart the affected node

Timeline

https://gitlab.slack.com/archives/C101F3796/p1574221148368800 (internal slack thread):

Seeing https://sentry.gitlab.net/gitlab/gitlabcom/?query=load_balancer on various API endpoints - has there been any change to the Rails LB config ?
undefined method `load_balancer' for nil:NilClass
(edited)

cmiskell  44 minutes ago
A release just hit gprd-cny about 45 minutes ago

Thong Kuah  44 minutes ago
The proxy seems to be only configured on a initializer ( https://gitlab.com/gitlab-org/gitlab/blob/b83380f3ea70de733d696f053ecc4f3c4f7da594/config/initializers/load_balancing.rb#L14) Some API nodes didn't start right ? How can we tell @cmiskell? (edited) 

cmiskell  42 minutes ago
Can sentry summarize by node?

cmiskell  41 minutes ago
Although i'd note that just looking at production.log on the two api-cny nodes suggests api-cny-02 is fine, and api-cny-01 is having a bad day

cmiskell  41 minutes ago
Or an intermittently bad day

Thong Kuah  39 minutes ago
No, but on kibana all 500s for my username points to api-cny-01-sv-gprd. (edited) 

Thong Kuah  38 minutes ago
Did api-cny-01-sv-gprd fail to initialize somehow  ? It should be all over the logs

cmiskell  37 minutes ago
It's not all dead, just partly dead.  It's serving a lot of requests ok, but some are failing with that stack trace

cmiskell  37 minutes ago
All the workeres there are new, started ~1h10min ago

Thong Kuah  37 minutes ago
Restart of the rails process should fix it (if the initializer will run successfully this time). If not...

cmiskell  36 minutes ago
I'd like to poke at it for at least another couple of minutes, see if we can diagnose a bit further.

Thong Kuah  36 minutes ago
Fair

cmiskell  34 minutes ago
Could this have left one unicorn process with a broken set of connectivity to the DB?
Right at boot time 
PG::ConnectionBad (ERROR:  pgbouncer cannot connect to server
):
  
lib/feature.rb:15:in `feature_names'
lib/feature.rb:40:in `block in persisted_names'
Click to expand inline (32 lines)



Thong Kuah  31 minutes ago
Sounds about right. Kibana says about 100/2000 requests died with 500.

cmiskell  31 minutes ago
30 workers, yeah, that's pretty close.

Thong Kuah  31 minutes ago
That initializer also checks for ActiveRecord::Base.connected?.

cmiskell  31 minutes ago
I can't track it down to a specific worker, so I'm going to drain + HUP that node.

Areas of investigation

If we detect that LoadBalancing.proxy is nil, we should log an exception.
If we're in Sidekiq, we should stop using the load balancing feature but don't fail hard.
(If options 1&2 don't work) If we're in Unicorn, we either can try to re-initialize the load balancing OR fail Unicorn hard OR skip load balancing. The latter might have bad effects on the primary, so I'm inclined to try to recover.

Edited Dec 17, 2019 by Craig Gomes