chore: tweak liveness probe
What does this merge request do and why?
This MR tweaks the liveness probe of Cloud Run similar to chore: enable startup probe for production (!904 - merged).
We are observing https://gitlab.com/groups/gitlab-org/-/epics/15402+ that occasionally liveness probes fail and trigger multiple-instance reboots, which affects the error rate.
For example, this brief incident occurred at 2024-10-03 07:00.
https://log.gprd.gitlab.net/app/r/s/dZypi
We also see the healthcheck request failures in https://dashboards.gitlab.net/d/runway-service/runway3a-runway-service-metrics?orgId=1.
FYI, the current settings is:
livenessProbe:
timeoutSeconds: 1
periodSeconds: 10
failureThreshold: 3
httpGet:
path: /monitoring/healthz
port: 8080
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.
Merge request checklist
-
Tests added for new functionality. If not, please raise an issue to follow up. -
Documentation added/updated, if needed.
Edited by Shinya Maeda