Add missing topology metrics to usage ping (!33191) · Merge requests · GitLab.org / GitLab

Matthias Käppler requested to merge 216660-add-basic-topology-metrics into master May 27, 2020

What does this MR do?

This is a follow-up to !32315 (merged), which was the initial iteration that added Prometheus support to usage pings, plus a single node metric.

This MR adds all remaining topology metrics that we would like to track via Usage Ping for the MVC, specifically:

the number of CPU cores per node
which Ruby services are running on each node, plus:
- service process count
- process memory RSS (resident set size)
- process memory USS (unique set size)
- process memory PSS (proportional set size)

This will give us good initial idea of how customers deploy GitLab and how much memory is consumed by the primary services (the Rails services -- we are looking to extend this later on to other components as well.)

NOTE that as with the original MR, all of this will only apply to single-node installations for now, since we do not yet have the capabilities to locate an external Prometheus node. This will change at some point in the future though, so can never hurt to look at this through the "future looking glass" 🔭

I also decided to extract all topology related usage data collection into a Concern, since it was getting far too complex to continue living in UsageData itself. That makes it easier to test, too.

Example

Here's what the payload will look like (pulled from QA preview container):

{
  "topology": {
    "nodes": [
      {
        "node_memory_total_bytes": 33269903360,
        "node_cpus": 16,
        "node_services": [
          {
            "name": "gitlab_rails",
            "process_count": 16,
            "process_memory_pss": 233349888,
            "process_memory_rss": 788220927,
            "process_memory_uss": 195295487
          },
          {
            "name": "gitlab_sidekiq",
            "process_count": 1,
            "process_memory_pss": 734080000,
            "process_memory_rss": 750051328,
            "process_memory_uss": 731533312
          }
        ]
      }
    ],
    "duration_s": 0.013836685999194742
  }
}

I'm looking for feedback on this data structure as well (~"group::telemetry").

Performance impact

Querying Prometheus as part of usage ping opens questions around the performance impact of course. I benchmarked the impact of running the 4 queries currently used against our production Prometheus servers. The benchmark can be found here: https://gitlab.com/gitlab-org/gitlab/snippets/1983636

The results of sending all 4 queries swing quite wildly between .9 seconds to 5 seconds:

Rehearsal ------------------------------------------------
app queries    0.043742   0.000242   0.043984 (  0.699953)
main queries   0.034275   0.000000   0.034275 (  0.581425)
all queries    0.081021   0.003966   0.084987 (  1.252002)
--------------------------------------- total: 0.163246sec

                   user     system      total        real
app queries    0.024574   0.000000   0.024574 (  0.586431)
main queries   0.017995   0.003984   0.021979 (  0.473990)
all queries    0.042974   0.000185   0.043159 (  1.151759)
Rehearsal ------------------------------------------------
app queries    0.022172   0.000135   0.022307 (  0.475476)
main queries   0.012620   0.008028   0.020648 (  0.465121)
all queries    0.040645   0.000042   0.040687 (  0.954856)
--------------------------------------- total: 0.083642sec

                   user     system      total        real
app queries    0.045079   0.000069   0.045148 (  0.491550)
main queries   0.038586   0.000155   0.038741 (  0.479245)
all queries    0.046054   0.003882   0.049936 (  5.009503)
Rehearsal ------------------------------------------------
app queries    0.020487   0.000270   0.020757 (  0.503571)
main queries   0.021728   0.000336   0.022064 (  0.571416)
all queries    0.041445   0.000320   0.041765 (  1.104775)
--------------------------------------- total: 0.084586sec

                   user     system      total        real
app queries    0.019996   0.003766   0.023762 (  0.466840)
main queries   0.021804   0.000271   0.022075 (  0.486701)
all queries    0.040141   0.000236   0.040377 (  4.955309)
Rehearsal ------------------------------------------------
app queries    0.016684   0.003810   0.020494 (  0.461673)
main queries   0.025718   0.000325   0.026043 (  4.465426)
all queries    0.039920   0.000223   0.040143 (  0.995094)
--------------------------------------- total: 0.086680sec

                   user     system      total        real
app queries    0.023683   0.000000   0.023683 (  0.530871)
main queries   0.016763   0.003819   0.020582 (  0.513517)
all queries    0.037971   0.004302   0.042273 (  1.195756)
Rehearsal ------------------------------------------------
app queries    0.017204   0.004286   0.021490 (  0.474670)
main queries   0.020740   0.000262   0.021002 (  0.565457)
all queries    0.040274   0.000015   0.040289 (  1.125658)
--------------------------------------- total: 0.082781sec

                   user     system      total        real
app queries    0.023898   0.000214   0.024112 (  0.445421)
main queries   0.022364   0.000091   0.022455 (  0.448126)
all queries    0.042450   0.000000   0.042450 (  0.942314)

Considering that this job only runs once a week, and considering furthermore that this is querying gitlab.com data which I presume is much vaster than any of our clients', it is probably acceptable.

Does this MR meet the acceptance criteria?

Conformity

[-] Changelog entry
- We already added a changelog entry for the MR this one builds on. Since we remained relatively vague about what exactly we're tracking, and since we will add documentation too, I don't think we need another one.
[-] Documentation (if required)
- We already have a separate issue open for this: #220143 (closed)
Code review guidelines
Merge request performance guidelines
Style guides
[-] Database guides
[-] Separation of EE specific content

Availability and Testing

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
[-] Tested in all supported browsers
[-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Test in Omnibus build

Notes on testing:

To test this locally, I have published an Omnibus image via CI that can be pulled like so:

docker pull registry.gitlab.com/gitlab-org/build/omnibus-gitlab-mirror/gitlab-ee:02233b62afd6d122236ecb5ff118cc47fe7bc062

When running this container, you can preview the Usage Ping payload as normally from the Admin Area > Usage statistics panel.

Edited May 31, 2022 by 🤖 GitLab Bot 🤖

Add missing topology metrics to usage ping