Improve cache hit ratios with nginx caching and proxy_cache_revalidate

On 2018-06-14, GitLab.com was affected by an outage caused by an application that was using GitLab.com as a content distribution mechanism. The application was being executed on hundreds or thousands of hosts and was concurrently polling for content changes in a repository, and using the Projects::RawController#show rails endpoint to return the content.

During the weekly infrastructure call, @jarv raised the point of better caching, since each client was loading the same blob data from Git, but no caching was being done.

I suggested that this caching does not need to happen at an application level, and could take place in nginx, which has very good caching built in, provided we stick to HTTP caching semantics.

References:

Infrastructure call minutes: https://docs.google.com/document/d/1B7pyJTv6HKPs5bBWWAjIYUKCb1i_UU3MQwh4D4e71vk/edit#heading=h.26c8go6r6hs
Outage: https://gitlab.com/gitlab-com/infrastructure/issues/4397
Outage details (confidential): https://docs.google.com/document/d/1xURna96DlBXbaQhgTqQTSW8VBWnR55If9TTbukGC0B0/edit#bookmark=id.5udrd1hpsz0f
https://gitlab.com/gitlab-org/gitlab-ce/issues/48234

Proposed Caching Solution

In, https://gitlab.com/gitlab-org/gitlab-ce/issues/26926, we built a mechanism to short circuit the generation of a full HTTP response when the client issues a Conditional HTTP request. This is very good for alleviating load generated by polling clients, when the data is unlikely to change frequently.

This implementation has worked well for over a year.

The only "problem" with the current implementation is that the client needs to have loaded the response already. If a second client makes the same request, the response needs to be regenerated, at full expense (to postgres, gitaly etc)

I would like to investigate adding caching to our nginx configuration, specifically using the proxy_cache_revalidate option. http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache_revalidate

Using this approach, nginx can "convert" a non-conditional HTTP request (from a new client) into a conditional HTTP request to the backend. If nginx has a cached copy of the content, it can issue a conditional request to ensure that 1) the client is allowed to access the data 2) the data is still current. If the server responds with a 304 Not Modified response, nginx can "convert" the response back to a full non-conditional 200 Success response using the data in it's cache.

This approach means that our existing conditional Etag approach can be shared between multiple clients which may not have cached copies (or use clients that don't issue conditional requests, such as curl)

One downside to this approach is that the cached content cannot be shared between the nginx fleet. This means that each nginx server will maintain it's own cache, lowering the hit rate. There are two further steps we can take to improve this: 1) run GitLab.com behind a CDN, which can issue the same conditional requests as nginx. 2) upgrade to nginx pro, which supports federated caching.

This approach (of using proxy_cache_revalidate) was employed successfully in the Gitter Avatar service and substantially reduced traffic from the backend: https://gitlab.com/gl-infra/gitter-infrastructure/blob/master/ansible/roles/gitter/avatars/templates/nginx-conf.j2

cc @jarv (for raising the caching question)

cc @northrup (for CDN knowledge)

cc @glopezfernandez

@sytses was concerned about testability of this, but since we're only sticking to standard HTTP caching semantics, this should not be difficult and has already been proven through the existing etag support.