Investigate Prometheus connection failures in Usage Ping
We get 5773 usage ping data from version 13.2, where 3197 has empty data(only duration_s , or at most has nodes: []). Among those 3197 empty data records, there are around 2400 failed due to Gitlab::PrometheusClient::ConnectionError . So from instance wise, 42%(2400/5773) instances failed to get topology data due to this Prometheus connection failure.
There are 2409 instances failed by Prometheus connection error for ALL 9 queries, per the chart : https://app.periscopedata.com/app/gitlab/679200/Topology-Dashboard?widget=9334991&udv=0
Update Aug 24, 2020: More data submission received along time, as of Aug 24, the latest data is:
- 22746 instances failed by Prometheus connection error for ALL 9 queries
- 29344 instances has empty nodes data
- 25542 instances has nodes data
So this concern should get more attention: about 77.5%(22746/29344) instances failed to get topology data due to this Prometheus connection failure.