Handle client disconnects better in workhorse
What does this MR do and why?
When a client disconnects before it has received a response, Workhorse currently detects this as a 502 (logged and put into metrics). However this ends up in metrics which for gitlab.com and GitLab Dedicated is counted as an error against the SLA, when it is entirely a client-side problem. Nginx logs this as a 499 (a non-standard code, but a useful one).
This patch extends that behavior to Workhorse, such that when a client disconnects early, it is counted as a 499 and not a server error, because it is strictly the client's problem (and probably just a browser window being closed or similar).
The original problem was seen in GitLab Dedicated (https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/1503 - internal access only), and the patch was supplied by @jacobvosmaer-gitlab in that issue, I've just put it into an MR and done some testing.
I tried to write a golang test for this, but got tangled up in goroutines and contexts and couldn't quite make it work, but this seems like something that would be good to have if a smarter Go programmer could help out.
How to set up and validate locally
Run a simple webserver that accepts any requests, but contains a delay. Example:
from http.server import HTTPServer, BaseHTTPRequestHandler
import time
class MyHandler(BaseHTTPRequestHandler):
def do_GET(self):
time.sleep(1)
# send 200 response
self.send_response(200)
# send response headers
self.end_headers()
# send the body of the response
self.wfile.write(bytes("It Works!", "utf-8"))
httpd = HTTPServer(('localhost', 8080), MyHandler)
httpd.serve_forever()
Run workhorse, and then use curl to make a request but with a smaller timeout than the sleep
curl -m 0.5 http://localhost:8081/api/v4/foo
Observe that without the patch workhorse logs a 502. Optionally also enable the metrics server on workhorse and check metrics like gitlab_workhorse_http_requests_total to confirm how the status code is being recorded there.
Repeat with the patched workhorse and observe that it now logs a 499 and that also shows up in the metrics under the same status code.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.