Geo: Monitoring
To make a robust monitoring solution for Geo we have to answer the following questions, with one or more solutions:
- How to make sure replication is working?
- Prometheus metrics, API status endpoint, admin panel track these:
-
Database replication -
Rest API based replication -
Git repositories
-
How to detect delays in database replication? Tracked in prom, API status endpoint, admin panel - How to detect delays in sidekiq dependent replications? (git/wiki/ssh keys)
-
How to track failures? https://gitlab.com/gitlab-org/gitlab-ee/issues/2968 -
How to track reschedules? https://gitlab.com/gitlab-org/gitlab-ee/issues/3119 -
Everything is async, how to make sure we are not loosing important data? https://gitlab.com/gitlab-org/gitlab-ce/issues/39949 https://gitlab.com/gitlab-org/gitlab-ce/issues/40228
-
- How to detect if primary can communicate with secondary?
-
Via HTTP / HTTPS (check certificates) rake geo:gitlab:check
-
- How to detect if secondary can communicate with primary?
-
Via HTTP / HTTPS (check certificates) rake geo:gitlab:check
- [-] Via SSH (deprecated, to be removed)
-
Proposal
We have things that should be checked once during setup, that should not change during execution time, and we have state and failures that can happen during execution time.
For the first set of things, it should be part of either a rake task or a configuration check page in Admin screen.
For the second set of things we should add something to the Health Check API endpoint, or similar with more verbose details.
There are some interesting resources to explore from sidekiq monitoring here:
- https://github.com/mperham/sidekiq/wiki/Monitoring#monitoring-queue-backlog
- https://github.com/mperham/sidekiq/wiki/API
Related issues:
- #1611 (closed)
- #1255 (closed)
- #1664 (closed)
- #1751 (closed)
- gitlab-org/gitlab-ce#28080
Edited by Nick Thomas