External PostgreSQL - Segmentation fault citing LooseForeignKeys::CleanupWorker causes complete database restart

Summary

We have had a number of customers reporting 500s and other issues that were traced to PostgreSQL restarting.

Errors in Rails include:

"FATAL:  the database system is in recovery mode\nFATAL:  the database system is in recovery mode\n"
"PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly\n\tThis probably means the server terminated abnormally\n\tbefore or while processing the request.\n"
"PG::ConnectionBad: PQconsumeInput() SSL SYSCALL error: EOF detected\n"
"PG::UnableToSend: no connection to the server\n"
"PG::UnableToSend: SSL SYSCALL error: EOF detected\n"
"server closed the connection unexpectedly\n\tThis probably means the server terminated abnormally\n\tbefore or while processing the request.\n"
"SSL SYSCALL error: Connection reset by peer\nFATAL:  the database system is in recovery mode\n"
"SSL SYSCALL error: EOF detected\n"
"SSL SYSCALL error: Success\nFATAL:  the database system is in recovery mode\n"

Errors in the PostgreSQL log:

FATAL: the database system is in recovery mode 
LOG: server process (PID 10611) was terminated by signal 11: Segmentation fault 
DETAIL: Failed process was running: /*application:sidekiq,correlation_id:<ID>,jid:<JID>,endpoint_id:LooseForeignKeys::CleanupWorker,db_config_name:main*/ DELETE FROM "ci_pipelines" WHERE ("ci_pipelines"."id") IN (SELECT "ci_pipelines"."id" FROM "ci_pipelines" WHERE "ci_pipelines"."merge_request_id" IN (12345, 12346) LIMIT 1000 FOR UPDATE SKIP LOCKED) 
LOG: terminating any other active server processes 
WARNING: terminating connection because of crash of another server process 
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 
LOG: all server processes terminated; reinitializing 
LOG: database system was interrupted; last known up at <TIME>
LOG: database system was not properly shut down; automatic recovery in progress

The pattern observed so far is that customers are using external PostgreSQL, and all examples to date have been AWS RDS.

🛠 Fix

The fix is to upgrade to the latest patch level of PostgreSQL - minimum 12.7 or 13.3, but to avoid any other known bugs, select the latest patch level of your current major release (12.x or 13.x)

Details of the bug/fix (provided on a ticket update - link for GitLab team members)

🛠 Workaround

LooseForeignKeys::CleanupWorker runs in Sidekiq as a cron job, by default every minute.

If this issue is occurring on a patched version of PostgreSQL, then a temporary fix is to change the cron job so it only runs once a week.

Add to gitlab.rb:

gitlab_rails['loose_foreign_keys_cleanup_worker_cron'] = "59 23 * * 7"

Run gitlab-ctl reconfigure.

This doesn't entirely stop the issue, but it does limit it to one run a week when the database is likely to crash.

Steps to reproduce

This bug seems to be triggered by LooseForeignKeys::CleanupWorker which was released:

For EE customes in %14.8 - !75511 (merged)
For CE customers in %15.1 - !87983 (merged)

Example Project

What is the current bug behavior?

Unplanned PostgreSQL restarts

What is the expected correct behavior?

No issues with PostgreSQL!

Relevant logs and/or screenshots

Output of checks

Possible fixes

Upgrade PostgreSQL to a patch level that includes this bug fix - 12.7 or higher, 13.3 or higher.

See #364763 (comment 979383307) for some GitLab/PostgreSQL version info.

We did not see this with customers running Omnibus GitLab. Looking at packaged versions customers running PG 12 up to and including GitLab 14.10 would be running 12.7. 12.10 shipped with 15.0.

We initially believed 12.10 was the required version (GitLab.com and Omnibus deployments contradicted this evidence) but later confirmed that 12.7 has the fix.

If customers have upgraded to Omnibus PostgreSQL 13, then GitLab 14.7 shipped with 13.3, and as the worker went live in 14.8, we think these sites are not at risk.

Edited Jun 29, 2022 by Ben Prescott_