Geo: Improve info rake task and create geo specific check task
https://gitlab.zendesk.com/agent/tickets/52324
This ticket exposed some new ideas and concerns we have to take care when configuring / troubleshooting Geo (#76 (closed)).
The user had added and http_proxy
environmental variable to enable the machine to talk to the Internet (internal restrictions), in order to be able to reach Slack API.
While there is an issue with their proxy that prevents Geo nodes from communicating with each other, it's hard to reach that conclusion without enough data and investigation.
Proposal
Specifically for the proxy part, we should add a list of known proxies in the system (by watching for ENV variables that end with _proxy
, non case-sensitive).
After a discussion with the Build team, they also suggested we should display all ENV variables defined in the system, as this could help with unknown issues in the future.
Both proposals should also be backported to CE.
There is a third is EE exclusive, and will help figure out communication issues between Geo Nodes once and for all:
From the primary node, we should try to Faraday.get('')
an API endpoint in the secondary node's geo
namespace (something that uses the same authentication code for notification).
From the secondary node, we should try to reach the primary node's using any admin's key, pointing to the health check api endpoint or to the user's endpoint (this will either pinpoint issues with connection or with the token authentication).
Things we expect to detect:
- DNS issues
- Connection issues (timeout etc)
- SSL error
Related issues
This info task will help implement a few things described in the Monitoring proposal: #727 (closed)