Consider maximum retries for ES ConnectionFailed issues
Summary
We had a customer ticket(Internal) who had an Elasticsearch integration that was enabled but intentionally offline. Seemingly, it wasn't causing any issues until an upgrade to 16.0.4 which seemed to have kicked off the bulk indexing process again. In the GitLabSOS, we were seeing repeated Faraday::ConnectionFailed
errors:
{"severity":"ERROR","time":"2023-06-12T13:48:39.263Z","meta.caller_id":"ElasticIndexInitialBulkCronWorker","correlation_id":"f06941d8254d36650210e4b9beacba92","meta.root_caller_id":"Cronjob","meta.feature_category":"global_search","meta.client_id":"ip/","message":"bulk_exception","error_class":"Faraday::ConnectionFailed","error_message":"Failed to open TCP connection to <ip_address>:9200 (execution expired)"}
❯ rg 'Faraday::ConnectionFailed' ./elasticsearch.log | wc -l
1268
This caused the Sidekiq queue to grow to 2.4 million items which halted the operation of CI/CD, MRs, and background migrations. The solution here was to disable the integration, and then forcefully delete ElasticIndexInitialBulkCronWorker
and ElasticIndexBulkCronWorker
jobs.
We've had some similar issues and discussions:
Should we consider implementing a max retry for ConnectionFailed
issues prior to kicking off bulk indexing? It seems that this can unnecessarily cause performance issues on the instance if the Elasticsearch node is not available. This can happen to any user who might enable the feature to trial it, and then forgets to turn it off. From a UI standpoint, we might be able to alert the admin if there have been X amounts of retries connecting to the Elasticsearch node, and then remove the unprocessable jobs.
What is the current bug behavior?
Upgrading to GL 16 with ES enabled by unreachable results in a high sidekiq queue.
What is the expected correct behavior?
The sidekiq queue should not grow with ES jobs if ES is not reachable.
Relevant logs and/or screenshots
More information is available in the ticket(Internal)