Handle client disconnects better in workhorse (!109945) · Merge requests · GitLab.org / GitLab

Craig Miskell requested to merge workhorse-499-on-client-disconnect into master Jan 25, 2023

What does this MR do and why?

When a client disconnects before it has received a response, Workhorse currently detects this as a 502 (logged and put into metrics). However this ends up in metrics which for gitlab.com and GitLab Dedicated is counted as an error against the SLA, when it is entirely a client-side problem. Nginx logs this as a 499 (a non-standard code, but a useful one).

This patch extends that behavior to Workhorse, such that when a client disconnects early, it is counted as a 499 and not a server error, because it is strictly the client's problem (and probably just a browser window being closed or similar).

The original problem was seen in GitLab Dedicated (https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/1503 - internal access only), and the patch was supplied by @jacobvosmaer-gitlab in that issue, I've just put it into an MR and done some testing.

I tried to write a golang test for this, but got tangled up in goroutines and contexts and couldn't quite make it work, but this seems like something that would be good to have if a smarter Go programmer could help out.

How to set up and validate locally

Run a simple webserver that accepts any requests, but contains a delay. Example:

from http.server import HTTPServer, BaseHTTPRequestHandler
import time

class MyHandler(BaseHTTPRequestHandler):
    def do_GET(self):
      time.sleep(1)
      # send 200 response
      self.send_response(200)
      # send response headers
      self.end_headers()
      # send the body of the response
      self.wfile.write(bytes("It Works!", "utf-8"))

httpd = HTTPServer(('localhost', 8080), MyHandler)
httpd.serve_forever()

Run workhorse, and then use curl to make a request but with a smaller timeout than the sleep

curl -m 0.5 http://localhost:8081/api/v4/foo

Observe that without the patch workhorse logs a 502. Optionally also enable the metrics server on workhorse and check metrics like gitlab_workhorse_http_requests_total to confirm how the status code is being recorded there.

Repeat with the patched workhorse and observe that it now logs a 499 and that also shows up in the metrics under the same status code.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Handle client disconnects better in workhorse

What does this MR do and why?

How to set up and validate locally

MR acceptance checklist

Merge request reports