GraphQL endpoint for merge request getState requested every few seconds
ZD: https://gitlab.zendesk.com/agent/tickets/462493
The customer upgraded from PostgreSQL 12 to 13 in GitLab 15.11.13 to prepare for GitLab 16.2+. However, as soon as they upgraded the database, CPU load on puma
increased to over 80%+, when previously it was under 30%.
In the SOS logs from 20231021151659
, fast-stats
showed that GraphqlController#execute
and Projects::MergeRequests::ContentController#widget
dominated the request rate, with the former being the most problematic:
CONTROLLER COUNT RPS P99_ms P95_ms MEDIAN_ms MAX_ms MIN_ms SCORE %FAIL
GraphqlController#execute 3047 17.82 1991.1 1179.1 409.9 8144.7 22.9 6066930.5 0.00
Projects::MergeRequests::ContentController#widget 2932 17.15 456.0 302.6 82.5 1438.1 31.2 1337124.2 0.00
Repositories::GitHttpController#info_refs 215 1.26 437.0 325.2 50.3 878.5 10.5 93957.3 0.00
Projects::MergeRequests::ContentController#cached_widget 38 0.22 1068.1 978.6 408.4 1071.7 153.0 40586.2 0.00
Repositories::GitHttpController#git_upload_pack 50 0.29 508.4 430.5 159.8 544.7 53.1 25419.5 0.00
Projects::MergeRequestsController#ci_environments_status 19 0.11 563.2 406.5 158.3 602.4 51.1 10700.9 0.00
ApplicationCable::Connection#connect 27 0.16 243.2 207.4 9.4 255.0 3.5 6566.4 0.00
GraphqlChannel#subscribe 22 0.13 197.8 165.6 140.6 206.3 95.7 4352.1 0.00
ApplicationCable::Connection#disconnect 26 0.15 80.6 28.1 0.1 97.2 0.1 2094.8 0.00
MetricsController#index 6 0.04 327.3 312.7 234.5 331.0 220.6 1964.0 0.00
HealthController#readiness 110 0.64 15.2 8.0 0.9 18.9 0.7 1671.2 0.00
GraphqlChannel#unsubscribe 18 0.11 53.5 27.9 4.2 60.0 0.6 963.6 0.00
UploadsController#show 2 0.01 47.6 47.3 44.9 47.6 42.1 95.1 0.00
PwaController#manifest 1 0.01 23.9 23.9 23.9 23.9 23.9 23.9 0.00
There were about 86 users total involved, and it seemed based on the User-Agent (ua
) that most of the requests were coming from opened Chrome browsers:
% cat production_json.log | grep Graphql | jq .ua | sort | uniq -c | sort -r
2189 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36"
480 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36"
106 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
86 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.15"
84 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"
76 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
41 "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/118.0"
19 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/118.0"
6 "python-requests/2.28.1"
This makes me suspect that the user's browsers were left open, and for some reason the client was requesting these merge request endpoints quite frequently.
If I take the user with the most requests as an example, we see 4 different merge requests were related:
% cat production_json.log | grep Graphql | grep REDACTED_user | jq . | grep variables | grep iid | sort | uniq -c | sort -r
59 "variables": "{\"projectPath\"=>\"REDACTED/REDACTED\", \"iid\"=>\"16983\"}",
28 "variables": "{\"projectPath\"=>\"REDACTED/REDACTED\", \"iid\"=>\"17039\"}",
24 "variables": "{\"projectPath\"=>\"REDACTED/REDACTED\", \"iid\"=>\"13477\"}",
19 "variables": "{\"projectPath\"=>\"REDACTED/REDACTED\", \"iid\"=>\"13314\"}",
Look at the timestamps:
% cat production_json.log| grep Graphql | grep REDACTED_user | grep 16983 | jq .time
"2023-10-21T22:17:37.741Z"
"2023-10-21T22:17:40.918Z"
"2023-10-21T22:17:46.716Z"
"2023-10-21T22:17:46.799Z"
"2023-10-21T22:18:01.905Z"
"2023-10-21T22:18:03.820Z"
"2023-10-21T22:18:09.792Z"
"2023-10-21T22:18:15.752Z"
"2023-10-21T22:18:16.729Z"
"2023-10-21T22:18:21.721Z"
"2023-10-21T22:18:22.760Z"
"2023-10-21T22:18:24.807Z"
"2023-10-21T22:18:31.840Z"
"2023-10-21T22:18:36.729Z"
"2023-10-21T22:18:39.735Z"
"2023-10-21T22:18:41.812Z"
"2023-10-21T22:18:43.076Z"
"2023-10-21T22:18:43.800Z"
"2023-10-21T22:18:44.767Z"
"2023-10-21T22:18:50.721Z"
"2023-10-21T22:18:51.813Z"
"2023-10-21T22:18:58.067Z"
"2023-10-21T22:18:59.234Z"
"2023-10-21T22:19:11.044Z"
"2023-10-21T22:19:11.112Z"
"2023-10-21T22:19:11.827Z"
"2023-10-21T22:19:14.365Z"
"2023-10-21T22:19:15.362Z"
"2023-10-21T22:19:16.419Z"
"2023-10-21T22:19:17.658Z"
"2023-10-21T22:19:19.438Z"
"2023-10-21T22:19:21.650Z"
"2023-10-21T22:19:23.900Z"
"2023-10-21T22:19:25.639Z"
"2023-10-21T22:19:31.366Z"
"2023-10-21T22:19:32.360Z"
"2023-10-21T22:19:35.432Z"
"2023-10-21T22:19:37.566Z"
"2023-10-21T22:19:39.310Z"
"2023-10-21T22:19:43.380Z"
"2023-10-21T22:19:43.895Z"
"2023-10-21T22:19:47.518Z"
"2023-10-21T22:19:49.823Z"
"2023-10-21T22:19:51.234Z"
"2023-10-21T22:19:52.248Z"
"2023-10-21T22:19:52.936Z"
"2023-10-21T22:19:52.978Z"
"2023-10-21T22:19:59.957Z"
"2023-10-21T22:20:01.875Z"
"2023-10-21T22:20:02.765Z"
"2023-10-21T22:20:06.334Z"
"2023-10-21T22:20:06.873Z"
"2023-10-21T22:20:08.080Z"
"2023-10-21T22:20:09.777Z"
"2023-10-21T22:20:11.553Z"
"2023-10-21T22:20:19.083Z"
"2023-10-21T22:20:19.209Z"
"2023-10-21T22:20:20.916Z"
"2023-10-21T22:20:21.102Z"
I thought the frontend code was supposed to request this once, and then poll with increasingly long intervals (https://gitlab.com/gitlab-org/gitlab/-/blob/e3bb1d633d6b59c808234c8c07defecea9998142/app/assets/javascripts/vue_merge_request_widget/mr_widget_options.vue#L188-199, !108860 (merged)). In my testing, I saw that requests were spaced out every few minutes rather than every second.
I thought that the real-time improvements in #372514 (closed) would have reduced the need for polling, but I suspect something went wrong here.
@thomasrandolph Do you have any idea why the frontend might not be increasing its poll interval here?