NoMethodError: undefined method `load_balancer' for nil:NilClass
https://sentry.gitlab.net/gitlab/gitlabcom/issues/1075504/
Around 100 out of 1200 requests to this API node failed with 500
NoMethodError: undefined method `load_balancer' for nil:NilClass
gitlab/database/load_balancing/sticking.rb:83:in `load_balancer'
LoadBalancing.proxy.load_balancer
gitlab/database/load_balancing/sticking.rb:36:in `all_caught_up?'
load_balancer.all_caught_up?(location).tap do |caught_up|
gitlab/database/load_balancing/sticking.rb:44:in `unstick_or_continue_sticking'
Session.current.use_primary! unless all_caught_up?(namespace, id)
gitlab/database/load_balancing/rack_middleware.rb:23:in `stick_or_unstick'
Sticking.unstick_or_continue_sticking(namespace, id)
ee/api/helpers.rb:34:in `block in current_user'
.stick_or_unstick(env, :user, user.id)
...
(140 additional frame(s) were not displayed)
NoMethodError: undefined method `load_balancer' for nil:NilClass
gitlab/database/load_balancing/sticking.rb:83:in `load_balancer'
LoadBalancing.proxy.load_balancer
gitlab/database/load_balancing/sticking.rb:36:in `all_caught_up?'
load_balancer.all_caught_up?(location).tap do |caught_up|
gitlab/database/load_balancing/sticking.rb:44:in `unstick_or_continue_sticking'
Session.current.use_primary! unless all_caught_up?(namespace, id)
gitlab/database/load_balancing/rack_middleware.rb:23:in `stick_or_unstick'
Sticking.unstick_or_continue_sticking(namespace, id)
ee/api/helpers.rb:34:in `block in current_user'
.stick_or_unstick(env, :user, user.id)
...
(117 additional frame(s) were not displayed)
NoMethodError: undefined method `load_balancer' for nil:NilClass
Workaround
Restart the affected node
Timeline
https://gitlab.slack.com/archives/C101F3796/p1574221148368800 (internal slack thread):
Seeing https://sentry.gitlab.net/gitlab/gitlabcom/?query=load_balancer on various API endpoints - has there been any change to the Rails LB config ?
undefined method `load_balancer' for nil:NilClass
(edited)
cmiskell 44 minutes ago
A release just hit gprd-cny about 45 minutes ago
Thong Kuah 44 minutes ago
The proxy seems to be only configured on a initializer ( https://gitlab.com/gitlab-org/gitlab/blob/b83380f3ea70de733d696f053ecc4f3c4f7da594/config/initializers/load_balancing.rb#L14) Some API nodes didn't start right ? How can we tell @cmiskell? (edited)
cmiskell 42 minutes ago
Can sentry summarize by node?
cmiskell 41 minutes ago
Although i'd note that just looking at production.log on the two api-cny nodes suggests api-cny-02 is fine, and api-cny-01 is having a bad day
cmiskell 41 minutes ago
Or an intermittently bad day
Thong Kuah 39 minutes ago
No, but on kibana all 500s for my username points to api-cny-01-sv-gprd. (edited)
Thong Kuah 38 minutes ago
Did api-cny-01-sv-gprd fail to initialize somehow ? It should be all over the logs
cmiskell 37 minutes ago
It's not all dead, just partly dead. It's serving a lot of requests ok, but some are failing with that stack trace
cmiskell 37 minutes ago
All the workeres there are new, started ~1h10min ago
Thong Kuah 37 minutes ago
Restart of the rails process should fix it (if the initializer will run successfully this time). If not...
cmiskell 36 minutes ago
I'd like to poke at it for at least another couple of minutes, see if we can diagnose a bit further.
Thong Kuah 36 minutes ago
Fair
cmiskell 34 minutes ago
Could this have left one unicorn process with a broken set of connectivity to the DB?
Right at boot time
PG::ConnectionBad (ERROR: pgbouncer cannot connect to server
):
lib/feature.rb:15:in `feature_names'
lib/feature.rb:40:in `block in persisted_names'
Click to expand inline (32 lines)
Thong Kuah 31 minutes ago
Sounds about right. Kibana says about 100/2000 requests died with 500.
cmiskell 31 minutes ago
30 workers, yeah, that's pretty close.
Thong Kuah 31 minutes ago
That initializer also checks for ActiveRecord::Base.connected?.
cmiskell 31 minutes ago
I can't track it down to a specific worker, so I'm going to drain + HUP that node.
Areas of investigation
- If we detect that
LoadBalancing.proxy
isnil
, we should log an exception. - If we're in Sidekiq, we should stop using the load balancing feature but don't fail hard.
- (If options 1&2 don't work) If we're in Unicorn, we either can try to re-initialize the load balancing OR fail Unicorn hard OR skip load balancing. The latter might have bad effects on the primary, so I'm inclined to try to recover.
Edited by Craig Gomes