Add missing topology metrics to usage ping
What does this MR do?
This is a follow-up to !32315 (merged), which was the initial iteration that added Prometheus support to usage pings, plus a single node metric.
This MR adds all remaining topology metrics that we would like to track via Usage Ping for the MVC, specifically:
- the number of CPU cores per node
- which Ruby services are running on each node, plus:
- service process count
- process memory RSS (resident set size)
- process memory USS (unique set size)
- process memory PSS (proportional set size)
This will give us good initial idea of how customers deploy GitLab and how much memory is consumed by the primary services (the Rails services -- we are looking to extend this later on to other components as well.)
NOTE that as with the original MR, all of this will only apply to single-node installations for now, since we do not yet have the capabilities to locate an external Prometheus node. This will change at some point in the future though, so can never hurt to look at this through the "future looking glass"
I also decided to extract all topology
related usage data collection into a Concern
, since it was getting far too complex to continue living in UsageData
itself. That makes it easier to test, too.
See also #216660 (closed)
Example
Here's what the payload will look like (pulled from QA preview container):
{
"topology": {
"nodes": [
{
"node_memory_total_bytes": 33269903360,
"node_cpus": 16,
"node_services": [
{
"name": "gitlab_rails",
"process_count": 16,
"process_memory_pss": 233349888,
"process_memory_rss": 788220927,
"process_memory_uss": 195295487
},
{
"name": "gitlab_sidekiq",
"process_count": 1,
"process_memory_pss": 734080000,
"process_memory_rss": 750051328,
"process_memory_uss": 731533312
}
]
}
],
"duration_s": 0.013836685999194742
}
}
I'm looking for feedback on this data structure as well (~"group::telemetry").
Performance impact
Querying Prometheus as part of usage ping opens questions around the performance impact of course. I benchmarked the impact of running the 4 queries currently used against our production Prometheus servers. The benchmark can be found here: https://gitlab.com/gitlab-org/gitlab/snippets/1983636
The results of sending all 4 queries swing quite wildly between .9 seconds to 5 seconds:
Rehearsal ------------------------------------------------
app queries 0.043742 0.000242 0.043984 ( 0.699953)
main queries 0.034275 0.000000 0.034275 ( 0.581425)
all queries 0.081021 0.003966 0.084987 ( 1.252002)
--------------------------------------- total: 0.163246sec
user system total real
app queries 0.024574 0.000000 0.024574 ( 0.586431)
main queries 0.017995 0.003984 0.021979 ( 0.473990)
all queries 0.042974 0.000185 0.043159 ( 1.151759)
Rehearsal ------------------------------------------------
app queries 0.022172 0.000135 0.022307 ( 0.475476)
main queries 0.012620 0.008028 0.020648 ( 0.465121)
all queries 0.040645 0.000042 0.040687 ( 0.954856)
--------------------------------------- total: 0.083642sec
user system total real
app queries 0.045079 0.000069 0.045148 ( 0.491550)
main queries 0.038586 0.000155 0.038741 ( 0.479245)
all queries 0.046054 0.003882 0.049936 ( 5.009503)
Rehearsal ------------------------------------------------
app queries 0.020487 0.000270 0.020757 ( 0.503571)
main queries 0.021728 0.000336 0.022064 ( 0.571416)
all queries 0.041445 0.000320 0.041765 ( 1.104775)
--------------------------------------- total: 0.084586sec
user system total real
app queries 0.019996 0.003766 0.023762 ( 0.466840)
main queries 0.021804 0.000271 0.022075 ( 0.486701)
all queries 0.040141 0.000236 0.040377 ( 4.955309)
Rehearsal ------------------------------------------------
app queries 0.016684 0.003810 0.020494 ( 0.461673)
main queries 0.025718 0.000325 0.026043 ( 4.465426)
all queries 0.039920 0.000223 0.040143 ( 0.995094)
--------------------------------------- total: 0.086680sec
user system total real
app queries 0.023683 0.000000 0.023683 ( 0.530871)
main queries 0.016763 0.003819 0.020582 ( 0.513517)
all queries 0.037971 0.004302 0.042273 ( 1.195756)
Rehearsal ------------------------------------------------
app queries 0.017204 0.004286 0.021490 ( 0.474670)
main queries 0.020740 0.000262 0.021002 ( 0.565457)
all queries 0.040274 0.000015 0.040289 ( 1.125658)
--------------------------------------- total: 0.082781sec
user system total real
app queries 0.023898 0.000214 0.024112 ( 0.445421)
main queries 0.022364 0.000091 0.022455 ( 0.448126)
all queries 0.042450 0.000000 0.042450 ( 0.942314)
Considering that this job only runs once a week, and considering furthermore that this is querying gitlab.com
data which I presume is much vaster than any of our clients', it is probably acceptable.
Does this MR meet the acceptance criteria?
Conformity
- [-] Changelog entry
- We already added a changelog entry for the MR this one builds on. Since we remained relatively vague about what exactly we're tracking, and since we will add documentation too, I don't think we need another one.
- [-] Documentation (if required)
- We already have a separate issue open for this: #220143 (closed)
-
Code review guidelines -
Merge request performance guidelines -
Style guides - [-] Database guides
- [-] Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. - [-] Tested in all supported browsers
- [-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done
-
Test in Omnibus build
Notes on testing:
To test this locally, I have published an Omnibus image via CI that can be pulled like so:
docker pull registry.gitlab.com/gitlab-org/build/omnibus-gitlab-mirror/gitlab-ee:02233b62afd6d122236ecb5ff118cc47fe7bc062
When running this container, you can preview the Usage Ping
payload as normally from the Admin Area > Usage statistics panel.