Add or refine Prometheus metrics for GraphQL
Description
We do have a bunch of Prometheus metrics that help us to observe how GraphQL performs. It is not clear however, if the metrics we have are enough or if we should change them.
- Cardinality of the existing metrics can be a significant problem.
- We might want to add more metrics
Right now the metric graphql_duration_seconds_count
has cardinality value of 89624. See count(graphql_duration_seconds_count{env="gprd"})
in action. This is growing quickly, and we already do experience significant problems with gitlab_sql_duration_seconds_count
that has cardinality value of around 150k.
Problem
Some investigation needs to be done around existing metrics, to better understand if we should add new ones or refine existing ones.
The outcome might be addressing concerns over the high cardinality of graphql_duration_seconds_count metric
that we know will cause problems soon, and making it possible to observe latency and requests rate for given GraphQL field / type / resolver using Prometheus.
We may want to add a counter that aggregates hits for every GraphQL type / field / resolver and a histogram that gets only recorded if we exceed some threshold, like 100ms, but in order to find the best solution some further investigation might be needed.
Solution proposal
Let's create a new tracer using GraphQL Ruby's tracing hooks.
Unlike the current generic_tracing.rb
, we shouldn't inherit from GraphQL::Tracing::PlatformTracing
since this class is part of the library's private API (according to the API docs).
For now, the new GraphQL tracer will just watch for two events (see this tracer for more examples):
-
execute_field
which contains metadata about the current field being executed -
execute_query
which contains metadata about a query being execute.
The new GraphQL tracer will write the duration of these events to two separate Prometheus metrics:
-
graphql_field_duration_seconds
which contains keys for thefeature_category
,field_name
,operation_name
. -
graphql_query_duration_seconds
which contains keys forfeature_category
,operation_name
.
Why split the graphql_field_duration_seconds
and graphql_query_duration_seconds
?
Since the fields could be executed in parallel, their duration won't add up to the total queries duration. Nevertheless, it's important to see the context of a certain field's execution, so we will include the operation_name
here.
Why not use or build off of the existing tracer generic_tracing.rb
or use the existing graphql_duration_seconds
?
Over in generic_tracing.rb
, we've copied graphql-ruby
implementation for prometheus tracing. This creates a massive bucket for :graphql_duration_seconds
which contains a very generic key:
and platform_key:
. If key="execute_field"
you can see a break down of every GraphQL key contained in platform_key
But, key:
can also be "execute_query", "analyze_query", etc. This metric is currently being used as a single bucket to trace every event coming from GraphQL. This is likely adding to it's very high cardinality.
Does the execute_query
event cover mutations as well?
I'm not sure, but the docs seem to suggest so. We'll need to verify this