Remove unsafe any_jobs check
What does this MR do and why?
Removes the any_job?
check from the signal_and_wait
function for restarting sidekiq when memory usage goes above a specific limit.
The restart_sidekiq function calls signal_and_wait
function 3 times to kill the sidekiq process with increasingly higher priority system calls. The signal_and_wait
function will only sleep if there are any_jobs?
. In the situation where there is no jobs, the code will travel right through to a Kill -9 before the first signal has been handled.
We came across this in our logs a couple of months back when trying to diagnose a problem where a resource group got stuck. We tracked it down to an issue with this job. In the logs we would see it being deduplicated as a duplicate of a job that we had not record of.
As we had a bad memory limit set (too low - 1G), we saw sidekiq restart every 15 minutes, and we would see a resource group get stuck nearly every week.
We concluded that sidekiq was in the process of accepting a job when it was killed by the memory killer.
This MR represents the fix we have monkey patched on a 14.9 self-hosted install, and it has been stable now for a couple of months.
Upon revisiting this issue, it might simply be the case that the &&
should become ||
, but this MR is what we have running and could serve as a conversation starter.
How to set up and validate locally
We were unable (due to lack of knowledge probably) to prove this locally, but as mentioned above, this is the fix we have working in a production environment. (exact patch for 14.9 is shared as an attachment memory_killer.patch)
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #381139 (closed)