Increase reliability of node to service mappings
What does this MR do?
We submit topology information pulled from Prometheus with Usage Ping. Currently we target single-node topologies only, i.e. all Omnibus services will run on the same node.
In order to determine which GitLab service runs on which node/host, we currently inspect the instance
label returned with Prometheus queries. This can be tricky for the following reasons:
- Services running on the same node might all report their
instance
in different ways. For example, a service binding itself to0.0.0.0
to listen on all interfaces will be exported with that IP address. Moreover,localhost
services can sometimes be reported aslocalhost
, sometimes with thelo
IP addresses (127.0.0.1
and::1
for IPv4 and IPv6 respectively), yet they all run on the same host machine. - There is a problem with the current implementation where we only consider instances to be valid when returned from
node_exporter
metrics (since that in some sense is the authority for node metrics.) However, users can turnnode_exporter
s off in Omnibus; this would currently mean that we would get no metrics whatsoever. We should instead consider allinstance
labels from all metrics we pull when forming the "candidate set" of observed instances/nodes.
This MR aims to address both of these issues.
Example
Pulled from an Omnibus container:
"topology": {
"application_requests_per_hour": 140,
"nodes": [
{
"node_memory_total_bytes": 33269903360,
"node_cpus": 16,
"node_services": [
{
"name": "prometheus",
"process_count": 1,
"process_memory_rss": 139467434
},
{
"name": "web",
"process_count": 16,
"process_memory_rss": 772499797,
"process_memory_uss": 100584448,
"process_memory_pss": 142191872,
"server": "puma"
},
{
"name": "sidekiq",
"process_count": 3,
"process_memory_rss": 755548160,
"process_memory_uss": 736721578,
"process_memory_pss": 739047196
},
{
"name": "node-exporter",
"process_count": 1,
"process_memory_rss": 15381845
},
{
"name": "redis",
"process_count": 1,
"process_memory_rss": 13333845
},
{
"name": "postgres",
"process_count": 1,
"process_memory_rss": 16414037
},
{
"name": "workhorse",
"process_count": 1,
"process_memory_rss": 33464320
},
{
"name": "gitaly",
"process_count": 1,
"process_memory_rss": 34036394
}
]
}
],
"duration_s": 0.03569750499809743,
"failures": [
]
}
The fix here is visible in that I am binding prometheus
to 0.0.0.0.
instead of localhost
. It is now included in the topology payload, when previously, it was not, as can be seen e.g. here: !35618 (merged)
Does this MR meet the acceptance criteria?
Conformity
- [-] Changelog entry
- [-] Documentation (if required)
-
Code review guidelines -
Merge request performance guidelines -
Style guides - [-] Database guides
- [-] Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. - [-] Tested in all supported browsers
- [-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done
-
Test in Omnibus
Closes #222530 (closed)
Edited by Matthias Käppler