Add or refine Prometheus metrics for GraphQL

Description

We do have a bunch of Prometheus metrics that help us to observe how GraphQL performs. It is not clear however, if the metrics we have are enough or if we should change them.

Cardinality of the existing metrics can be a significant problem.
We might want to add more metrics

Right now the metric graphql_duration_seconds_count has cardinality value of 89624. See count(graphql_duration_seconds_count{env="gprd"}) in action. This is growing quickly, and we already do experience significant problems with gitlab_sql_duration_seconds_count that has cardinality value of around 150k.

Problem

Some investigation needs to be done around existing metrics, to better understand if we should add new ones or refine existing ones.

The outcome might be addressing concerns over the high cardinality of graphql_duration_seconds_count metric that we know will cause problems soon, and making it possible to observe latency and requests rate for given GraphQL field / type / resolver using Prometheus.

We may want to add a counter that aggregates hits for every GraphQL type / field / resolver and a histogram that gets only recorded if we exceed some threshold, like 100ms, but in order to find the best solution some further investigation might be needed.

Solution proposal

Let's create a new tracer using GraphQL Ruby's tracing hooks.

Unlike the current generic_tracing.rb, we shouldn't inherit from GraphQL::Tracing::PlatformTracing since this class is part of the library's private API (according to the API docs).

For now, the new GraphQL tracer will just watch for two events (see this tracer for more examples):

execute_field which contains metadata about the current field being executed
execute_query which contains metadata about a query being execute.

The new GraphQL tracer will write the duration of these events to two separate Prometheus metrics:

graphql_field_duration_seconds which contains keys for the feature_category, field_name, operation_name.
graphql_query_duration_seconds which contains keys for feature_category, operation_name.

Why split the graphql_field_duration_seconds and graphql_query_duration_seconds?

Since the fields could be executed in parallel, their duration won't add up to the total queries duration. Nevertheless, it's important to see the context of a certain field's execution, so we will include the operation_name here.

Why not use or build off of the existing tracer generic_tracing.rb or use the existing graphql_duration_seconds?

Over in generic_tracing.rb, we've copied graphql-ruby implementation for prometheus tracing. This creates a massive bucket for :graphql_duration_seconds which contains a very generic key: and platform_key:. If key="execute_field" you can see a break down of every GraphQL key contained in platform_key

But, key: can also be "execute_query", "analyze_query", etc. This metric is currently being used as a single bucket to trace every event coming from GraphQL. This is likely adding to it's very high cardinality.

Does the execute_query event cover mutations as well?

I'm not sure, but the docs seem to suggest so. We'll need to verify this 👀

/cc @reprazent @pslaughter @dsatcher

Edited Sep 08, 2021 by Paul Slaughter