recovery.conf in GSTG needing recovery_target_timeline setting?
From #358 (comment 70387124), we fixed a replication issue by setting recovery.conf
to use:
recovery_target_timeline = 'latest'
Couple of questions:
- Does this need to be set in our GSTG environment? @ahanselka
- Does this setting need to be in Omnibus/Geo? @ibaum, @abrandl
Background:
@dbalexandre and I noticed that postgres-01 on GSTG was not replicating properly, and it looks like Wal-E was looking for a file that did not exist:
2018-05-09_19:09:12.30035 postgres-01-db-gstg postgresql: wal_e.operator.backup INFO MSG: begin wal restore
2018-05-09_19:09:12.30148 postgres-01-db-gstg postgresql: STRUCTURED: time=2018-05-09T19:09:12.299546-00 pid=16610 action=wal-fetch key=s3://gitlab-dbstg-backups/postgres02/wal_005/0000000600002D8100000025.lzo prefix=postgres02/ seg=0000000600002D8100000025 state=begin
2018-05-09_19:09:12.58194 postgres-01-db-gstg postgresql: gpg: decrypt_message failed: Unknown system error
2018-05-09_19:09:12.58326 postgres-01-db-gstg postgresql: lzop: <stdin>: not a lzop file
2018-05-09_19:09:12.58456 postgres-01-db-gstg postgresql: wal_e.blobstore.s3.s3_util INFO MSG: could no longer locate object while performing wal restore
2018-05-09_19:09:12.58510 postgres-01-db-gstg postgresql: DETAIL: The absolute URI that could not be located is s3://gitlab-dbstg-backups/postgres02/wal_005/0000000600002D8100000025.lzo.
2018-05-09_19:09:12.58558 postgres-01-db-gstg postgresql: HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
2018-05-09_19:09:12.58605 postgres-01-db-gstg postgresql: STRUCTURED: time=2018-05-09T19:09:12.584213-00 pid=16610
2018-05-09_19:09:12.58824 postgres-01-db-gstg postgresql: wal_e.operator.backup INFO MSG: complete wal restore
2018-05-09_19:09:12.58877 postgres-01-db-gstg postgresql: STRUCTURED: time=2018-05-09T19:09:12.588003-00 pid=16610 action=wal-fetch key=s3://gitlab-dbstg-backups/postgres02/wal_005/0000000600002D8100000025.lzo prefix=postgres02/ seg=0000000600002D8100000025 state=complete
This was similar to #358 (closed), so we added recovery_target_timeline = 'latest'
to recovery.conf
. After restarting, that made things worse:
2018-05-10_19:50:33.47904 postgres-01-db-gstg postgresql: gpg: Sorry, we are in batchmode - can't get input
2018-05-10_19:50:33.47936 postgres-01-db-gstg postgresql: lzop: <stdin>: not a lzop file
2018-05-10_19:50:33.57875 postgres-01-db-gstg postgresql: wal_e.blobstore.s3.s3_util WARNING MSG: retrying WAL file fetch from unexpected exception
2018-05-10_19:50:33.57885 postgres-01-db-gstg postgresql: DETAIL: The exception type is <class 'wal_e.exception.UserCritical'> and its value is CRITICAL: MSG: pipeline process did not exit gracefully
2018-05-10_19:50:33.57886 postgres-01-db-gstg postgresql: DETAIL: "gpg2 -d -q --batch --pinentry-mode loopback" had terminated with the exit status 2.
2018-05-10_19:50:33.57888 postgres-01-db-gstg postgresql: STRUCTURED: time=2018-05-10T19:50:33.578388-00 pid=27908 and its traceback is File "/opt/wal-e/lib/python3.5/site-packages/wal_e/retries.py", line 62, in shim
2018-05-10_19:50:33.57889 postgres-01-db-gstg postgresql: return f(*args, **kwargs)
2018-05-10_19:50:33.57890 postgres-01-db-gstg postgresql: File "/opt/wal-e/lib/python3.5/site-packages/wal_e/blobstore/s3/s3_util.py", line 139, in download
2018-05-10_19:50:33.57892 postgres-01-db-gstg postgresql: raise
2018-05-10_19:50:33.57893 postgres-01-db-gstg postgresql: File "/opt/wal-e/lib/python3.5/site-packages/wal_e/pipeline.py", line 115, in __exit__
2018-05-10_19:50:33.57894 postgres-01-db-gstg postgresql: command.finish()
2018-05-10_19:50:33.57895 postgres-01-db-gstg postgresql: File "/opt/wal-e/lib/python3.5/site-packages/wal_e/pipeline.py", line 204, in finish
2018-05-10_19:50:33.57896 postgres-01-db-gstg postgresql: .format(" ".join(self._command), retcode))
2018-05-10_19:50:33.57897 postgres-01-db-gstg postgresql: There have been 3879 attempts to fetch wal file s3://gitlab-dbstg-backups/postgres02/wal_005/00000007.history.lzo so far.