Make it possible to define custom request target duration (!69877) · Merge requests · GitLab.org / GitLab

Quang-Minh Nguyen requested to merge qmnguyen0711/1223-make-it-possible-to-define-custom-request-duration-thresholds-in-the-rails-application into master Sep 08, 2021

What does this MR do and why?

For gitlab-com/gl-infra/scalability#1223 (closed)

We are applying feature categorization system to classify Web/Api/Worker components into different feature categories. These categories are then used in many places, especially error budget. After the error budget system is released, we realized that the recent one-for-all latency thresholds (originally 1 second, but then adjusted to 5 seconds, are not perfect for everyone across the companies. Some endpoints are slow by nature; others are critical to our system, hence need to be conservatively fast. This MR is to make it possible to customize the target duration for each endpoints.

One good way of doing this is to extend the recent feature_category DSL to include action options. It's fortunate for us that the recent DSL signature declares the actions as an array, instead of spreading out: def feature_category(category, actions = []. We can easily add the action options as a keyword argument. I think the new interface looks like this:

      target_duration :slow 
      target_duration :fast
      target_duration :slow, [:pipeline_status, :exposed_artifacts]

      # Grape API, declare at API handler level
      target_duration :fast, [
        '/users/:id/custom_attributes', 
        '/users/:id/custom_attributes/:key'
      ]

      # Grape API, declare at handler level
      post target_duration: :very_fast do
        # Blah blah
      end

The target duration option consists of 4 options. The default option is medium

Name	Max duration	Cumulative traffic share	Thoughts
`very_fast`	0.25s	95.99%	I think this threhsold would be handy to have for super busy endpoints (`/jwt/auth` or `Repositories::GitHttpController#info_refs`)
`fast`	0.5s	98.39%	So we can perhaps start striving for this to be the new default
`medium`	1s	99.62%	This is the default for most requests
`slow`	5s	99.99%	For the known slow endpoints

Using abbreviation instead of numbers is more friendly to engineers, and removes the paradox of choices. In fact, reasoning and debating which option to choose from are already time-consuming enough.

To achieve the goals, this MR:

Add new target_duration DSL in Gitlab::WithFeatureCategory to support options.
Provide some helper methods to access target durations in Controllers and APIs.
Provide a way to retrieve the target duration inside middleware layer, Gitlab::Metrics::RequestsRackMiddleware in detail. Technically, this should be a part of gitlab-com/gl-infra/scalability#1099 (closed), but it is nice to provide this method and corresponding tests beforehand.

Screenshots or screen recordings

No. There shouldn't be any changes from the user perspective.

How to set up and validate locally

Update a controller using the above DSL, for example UsersController

class UsersController < ApplicationController
  target_duration :fast, [:show]
  target_duration :slow, [:groups, :snippets]
end

Check the feature categories and target duration in the terminal:

[2] pry(main)> c = UsersController.new.tap { |c| c.action_name = :show }
=> #<UsersController:0x00000000044520>
[3] pry(main)> c.feature_category
=> "users"
[4] pry(main)> c.target_duration
=> :fast
[5] pry(main)> c = UsersController.new.tap { |c| c.action_name = :snippets }
=> #<UsersController:0x00000000044548>
[6] pry(main)> c.feature_category
=> "users"
[7] pry(main)> c.target_duration
=> :slow
5] pry(main)> c = UsersController.new.tap { |c| c.action_name = :snippets }
=> #<UsersController:0x00000000044548>
[6] pry(main)> c.feature_category
=> "users"
[7] pry(main)> c.target_duration
=> :slow
[10] pry(main)> c = UsersController.new.tap { |c| c.action_name = :ssh_keys }
=> #<UsersController:0x00000000044598>
[11] pry(main)> c.feature_category
=> ""
[12] pry(main)> c.target_duration
=> :medium

MR acceptance checklist

These checklists encourage us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Quality

Confirmed

I have self-reviewed this MR per code review guidelines.
For the code that that this change impacts, I believe that the automated tests (Testing Guide) validate functionality that is highly important to users (including consideration of all test levels). If the existing automated tests do not cover this functionality, I have added the necessary additional tests or I have added an issue to describe the automation testing gap and linked it to this MR.
I have considered the technical aspects of the impact of this change on both gitlab.com hosted customers and self-hosted customers.
I have considered the impact of this change on the front-end, back-end, and database portions of the system where appropriate and applied frontend, backend and database labels accordingly.
I have tested this MR in all supported browsers, or determiend that this testing is not needed.
I have confirmed that this change is backwards compatible across updates, or I have decided that this does not apply.
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?)
If I am introducing a new expectation for existing data, I have confirmed that existing data meets this expectation or I have made this expectation optional rather than required.

Performance, reliability and availability

Confirmed

I am confident that this MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines)
I have added information for database reviewers in the MR description, or I have decided that it is unnecessary. (Does this MR have database-related changes?)
I have considered the availability and reliability risks of this change. I have also considered the scalability risk based on future predicted growth
I have considered the performance, reliability and availability impacts of this change on large customers who may have significantly more data than the average customer.

Documentation

Confirmed - The document will be fulfilled in gitlab-com/gl-infra/scalability#1237 (closed), after we finished the infrastructure change and collect examples

I have included changelog trailers, or I have decided that they are not needed. (Does this MR need a changelog?)
I have added/updated documentation, or I have decided that documentation changes are not needed for this MR. (Is documentation required?)

Security

Confirmed

I have confirmed that if this MR contains changes to processing or storing of credentials or tokens, authorization, and authentication methods, or other items described in the security review guidelines, I have added the label security and I have @-mentioned @gitlab-com/gl-security/appsec.

Deployment

Confirmed

I have considered using a feature flag for this change because the change may be high risk. If I decided to use a feature flag, I plan to test the change in staging before I test it in production, and I have considered rolling it out to a subset of production customers before doing rolling it out to all customers. When to use a feature flag
I have informed the Infrastructure department of a default setting or new setting change per definition of done, or decided that this is not needed.

Edited Sep 14, 2021 by Quang-Minh Nguyen

Make it possible to define custom request target duration