Action Cable: instrument and log websocket events (!116294) · Merge requests · GitLab.org / GitLab

Matthias Käppler requested to merge 402717-ac-handler-instrumentation into master Mar 30, 2023

What does this MR do and why?

GitLab is looking to double-down on moving to a UI that updates in real-time.

The current stack is implemented on top of two orthogonal technologies:

Action Cable, as an implementation of the websockets protocol.
GraphQL (specifically subscriptions), as a means to subscribe to and query for changes in data.

While GraphQL is well instrumented when processed via HTTP and GraphqlController, it is not when websockets are used as a transport. Specifically, we have significant observability gaps when it comes to Action Cable traffic:

We only log messages initiated by the client, such as subscribing to an Action Cable channel. However, upstream traffic that is sent back to a client in response to a subscription firing is not instrumented at all at the websocket/AC level. This is especially problematic with GraphQL subscriptions, because the pubsub handler will run a potentially expensive GraphQL query, for which we now miss standard telemetry such as CPU seconds spent, database query count etc.
It is currently impossible to correlate log events into a coherent stream of events with "caused by" relationships. For example, if I subscribe to an Action Cable channel, and another user sends a request that results in server-side event that updates the channel data I'm subscribed to, we cannot currently associate these two events.
While we do log GraphQL query traces to a separate log stream (graphql_json), because these are not written by lograge, they equally do not contain standard telemetry and are difficult to correlate with whatever caused these queries to run.

The reason this is difficult to instrument is because there is no such thing as a "request" when using websockets. They are endless streams of messages, which can span multiple server and client nodes. However, our current instrumentation solutions assume "request shaped" data with a single request being fully instrumented in the scope of a single web transaction.

This MR is intended as a step in the right direction to fill some of these gaps. Particularly it:

Fully instruments Action Cable broadcasts and stream_from handlers via lograge and InstrumentationHelper.
Allows us to associate individual log events via correlation IDs, all the way from the original client subscription to pushes that may happen an hour later.

The following is an example of a log event stream generated from my developer laptop while using the single issue page, which uses a websocket connection to subscribe to "issue epic changed" events:

tail -f log/development_json.log | egrep --line-buffered -v 'realtime_changes|GET' | jq '{ controller, action, "meta.caller_id", method, path, correlation_id, subscription_correlation_id, "action_cable.broadcasting", cpu_s, db_count }'

{
  "controller": "GraphqlChannel",
  "action": "subscribe",
  "meta.caller_id": null,
  "method": null,
  "path": null,
  "correlation_id": "01GX5SPHWJXXKMXYN09YBJPBNT",
  "subscription_correlation_id": null,
  "action_cable.broadcasting": null,
  "cpu_s": null,
  "db_count": 1
}
{
  "controller": "ApplicationCable::Channel",
  "action": "broadcast",
  "meta.caller_id": "graphql:projectIssueEpicMutation",
  "method": null,
  "path": null,
  "correlation_id": "01GX5SPXZAG3SAHJ3MD98GZHEV",
  "subscription_correlation_id": null,
  "action_cable.broadcasting": "graphql-event::issuableEpicUpdated:issuableId:Z2lkOi8vZ2l0bGFiL0lzc3VlLzM",
  "cpu_s": 0.259594,
  "db_count": 38
}
{
  "controller": "ApplicationCable::Channel",
  "action": "broadcast",
  "meta.caller_id": "GraphqlChannel#stream_handler",
  "method": null,
  "path": null,
  "correlation_id": "a0d0a84a621c1afa38dfdae680235de0",
  "subscription_correlation_id": null,
  "action_cable.broadcasting": "graphql-subscription:dfeae509-6c3b-4fad-be55-5feb1d39f845",
  "cpu_s": 0.039308,
  "db_count": 10
}
{
  "controller": "GraphqlChannel",
  "action": "stream_handler",
  "meta.caller_id": "GraphqlChannel#stream_handler",
  "method": null,
  "path": null,
  "correlation_id": "a0d0a84a621c1afa38dfdae680235de0",
  "subscription_correlation_id": "01GX5SPHWJXXKMXYN09YBJPBNT",
  "action_cable.broadcasting": "graphql-event::issuableEpicUpdated:issuableId:Z2lkOi8vZ2l0bGFiL0lzc3VlLzM",
  "cpu_s": 0.03995,
  "db_count": 10
}
{
  "controller": "GraphqlController",
  "action": "execute",
  "meta.caller_id": "GraphqlController#execute",
  "method": "POST",
  "path": "/api/graphql",
  "correlation_id": "01GX5SPXZAG3SAHJ3MD98GZHEV",
  "subscription_correlation_id": null,
  "action_cable.broadcasting": null,
  "cpu_s": 0.356221,
  "db_count": 48
}

Walking through this step by step:

The GitLab frontend subscribes to GraphqlChannel with correlation ID 01GX5SPHWJXXKMXYN09YBJPBNT. graphql-ruby will use this subscription to push query results back to the widget that subscribed.
I change the issue's epic in the sidebar. This sends a GraphQL mutation to the server; it is logged last because websocket hooks fire first as a result of changes to the issue data. We can see this POST request had correlation ID 01GX5SPXZAG3SAHJ3MD98GZHEV.
Action Cable broadcasts an event to the graphql-event::issuableEpicUpdated:issuableId:Z2lkOi8vZ2l0bGFiL0lzc3VlLzM topics, which is the GraphQL event I am subscribed to. It correlates with the mutation from the previous step since it also has correlation ID 01GX5SPXZAG3SAHJ3MD98GZHEV, which tells us it was caused by step 2. We can also see that its caller_id is graphql:projectIssueEpicMutation, which immediately tells us which mutation it was it fired in response to. In production, this event would fire on whichever Puma node served the POST request.
Action Cable broadcasts another event, this time to the topic graphql-subscription:dfeae509-6c3b-4fad-be55-5feb1d39f845 (this channel is used to push data back upstream) with caller ID GraphqlChannel#stream_handler. In production, this event would fire on the Puma node I am subscribed to from step 1. We can see that this broadcast was initiated from the GraphqlChannel stream handler, which is the Redis PubSub callback that runs when an event is received. Note that the correlation ID is different (a0d0a84a621c1afa38dfdae680235de0). This is unfortunate, but not a deal breaker, I expand on this below.
Action Cable's stream_handler logs a separate event: it is the action that actually executes the business logic. Think of it as the domain event handler. In the case of GraphQL, this is where the query will execute. It shares the same correlation ID as the broadcast, and also with its respective tracer log event. Note that it has an additional field: subscription_correlation_id. It will hold the correlation ID of the initial request that created the Action Cable subscription.
Finally, we see the log event for step 2. I think the reason it is logged last is likely due to timing on my local machine; order may be different in production and doesn't really matter anyway.

As mentioned, we can also find the GraphQL query tracer log event for the query that executed in step 5 via its correlation ID:

{
  "severity": "INFO",
  "time": "2023-04-04T09:31:33.197Z",
  "correlation_id": "a0d0a84a621c1afa38dfdae680235de0",
  "trace_type": "execute_query",
  "query_fingerprint": "issuableEpicUpdated/AvFPQq2dPB0X1Okf3qWtZ7WiDEjNQlWKrFESfdUUVe0=/1/3-lRmcoZEL9V0NyWaAWhtPXx0IaT4Xx-Cv_p33H7sR0=",
  "duration_s": 0.026578663999316632,
  "operation_name": "issuableEpicUpdated",
  "operation_fingerprint": "issuableEpicUpdated/AvFPQq2dPB0X1Okf3qWtZ7WiDEjNQlWKrFESfdUUVe0=",
  "is_mutation": false,
  "variables": "{\"issuableId\"=>\"gid://gitlab/Issue/3\"}",
  "query_string": "...",
  "meta.caller_id": "graphql:issuableEpicUpdated",
  "query_analysis.duration_s": 0.0007445669998560334,
  "query_analysis.depth": 3,
  "query_analysis.complexity": 9,
  "query_analysis.used_fields": [
    "Issue.id",
    "Issue.hasEpic",
    "Epic.id",
    "Epic.title",
    "Epic.webUrl",
    "Epic.__typename",
    "Issue.epic",
    "Issue.__typename",
    "Issuable.__typename",
    "Subscription.issuableEpicUpdated"
  ],
  "query_analysis.used_deprecated_fields": []
}

Limitations

One thing I would have loved to have is a single correlation ID for all of these events. This turned out to be difficult to solve. The problem is that it is not us who initiate most of these events, but graphql-ruby. Unfortunately, it provides no hooks to transport additional payload with a pubsub event such as a correlation_id (we would have to transport it across the network and redis from the node where the event fired to any nodes that have subscribers for this event).

This is probably solvable but I will leave it as a future improvement because while not as easily, we can already correlate all related events.

Also, some events still have missing data that should ideally be there, such as missing caller_id fields. This is unrelated to these changes and we can address that in follow-ups.

Implementation

This is where it gets (even) funkier. Unfortunately, neither Action Cable nor lograge provide us with the necessary hooks to perform the kind of instrumentation we desire. Therefore, I had to:

patch ActionCable::Channel to instrument the Redis pubsub handler with an ActiveSupport notification. If we find this useful, we could consider proposing this change upstream as well.
patch Lograge's LogSubscribers::ActionCable class to also process the new notification we emit, and add the additional fields and data we need for full "request" instrumentation.

I left extensive code comments as well since it may not be immediately clear these things were put in place.

Log volume

This is my biggest concern. With these changes in place, we will emit the following additional log events:

broadcast, 2 * N * M where N is the number of subscribers to a topic (e.g. "issue with id 42 epic updated") and M the number of times this topic is updated (issue 42 had its epic changed)
stream_handler: 1 * N * M with the same comments as above (it causes the second broadcast back upstream)

The remaining events, subscribe and GraphqlController#execute we already log.

The problem is that, AFAIK, we can't really know how many events this will generate, because this the catch-22 and why we need this kind of instrumentation to begin with. It is highly dependent on how many subscriptions exist at any point in time and how often data changes that is relevant to these subscriptions.

I suppose a (albeit unrealistic and overstated) ceiling would be to assume that all subscribers would receive all updates, but that doesn't really make sense. We do know that we currently serve around 3M subscription requests per hour: https://log.gprd.gitlab.net/goto/192b4a30-d2e3-11ed-8afc-c9851e4645c0

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Related to #402717

Edited Apr 04, 2023 by Matthias Käppler

Action Cable: instrument and log websocket events