Skip to content

Make it possible to define custom request target duration

What does this MR do and why?

For gitlab-com/gl-infra/scalability#1223 (closed)

We are applying feature categorization system to classify Web/Api/Worker components into different feature categories. These categories are then used in many places, especially error budget. After the error budget system is released, we realized that the recent one-for-all latency thresholds (originally 1 second, but then adjusted to 5 seconds, are not perfect for everyone across the companies. Some endpoints are slow by nature; others are critical to our system, hence need to be conservatively fast. This MR is to make it possible to customize the target duration for each endpoints.

One good way of doing this is to extend the recent feature_category DSL to include action options. It's fortunate for us that the recent DSL signature declares the actions as an array, instead of spreading out: def feature_category(category, actions = []. We can easily add the action options as a keyword argument. I think the new interface looks like this:

      target_duration :slow 
      target_duration :fast
      target_duration :slow, [:pipeline_status, :exposed_artifacts]

      # Grape API, declare at API handler level
      target_duration :fast, [
        '/users/:id/custom_attributes', 
        '/users/:id/custom_attributes/:key'
      ]

      # Grape API, declare at handler level
      post target_duration: :very_fast do
        # Blah blah
      end

The target duration option consists of 4 options. The default option is medium

Name Max duration Cumulative traffic share Thoughts
very_fast 0.25s 95.99% I think this threhsold would be handy to have for super busy endpoints (/jwt/auth or Repositories::GitHttpController#info_refs)
fast 0.5s 98.39% So we can perhaps start striving for this to be the new default
medium 1s 99.62% This is the default for most requests
slow 5s 99.99% For the known slow endpoints

Using abbreviation instead of numbers is more friendly to engineers, and removes the paradox of choices. In fact, reasoning and debating which option to choose from are already time-consuming enough.

To achieve the goals, this MR:

Screenshots or screen recordings

No. There shouldn't be any changes from the user perspective.

How to set up and validate locally

  • Update a controller using the above DSL, for example UsersController
class UsersController < ApplicationController
  target_duration :fast, [:show]
  target_duration :slow, [:groups, :snippets]
end
  • Check the feature categories and target duration in the terminal:
[2] pry(main)> c = UsersController.new.tap { |c| c.action_name = :show }
=> #<UsersController:0x00000000044520>
[3] pry(main)> c.feature_category
=> "users"
[4] pry(main)> c.target_duration
=> :fast
[5] pry(main)> c = UsersController.new.tap { |c| c.action_name = :snippets }
=> #<UsersController:0x00000000044548>
[6] pry(main)> c.feature_category
=> "users"
[7] pry(main)> c.target_duration
=> :slow
5] pry(main)> c = UsersController.new.tap { |c| c.action_name = :snippets }
=> #<UsersController:0x00000000044548>
[6] pry(main)> c.feature_category
=> "users"
[7] pry(main)> c.target_duration
=> :slow
[10] pry(main)> c = UsersController.new.tap { |c| c.action_name = :ssh_keys }
=> #<UsersController:0x00000000044598>
[11] pry(main)> c.feature_category
=> ""
[12] pry(main)> c.target_duration
=> :medium

MR acceptance checklist

These checklists encourage us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Quality

  • Confirmed
  1. I have self-reviewed this MR per code review guidelines.
  2. For the code that that this change impacts, I believe that the automated tests (Testing Guide) validate functionality that is highly important to users (including consideration of all test levels). If the existing automated tests do not cover this functionality, I have added the necessary additional tests or I have added an issue to describe the automation testing gap and linked it to this MR.
  3. I have considered the technical aspects of the impact of this change on both gitlab.com hosted customers and self-hosted customers.
  4. I have considered the impact of this change on the front-end, back-end, and database portions of the system where appropriate and applied frontend, backend and database labels accordingly.
  5. I have tested this MR in all supported browsers, or determiend that this testing is not needed.
  6. I have confirmed that this change is backwards compatible across updates, or I have decided that this does not apply.
  7. I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?)
  8. If I am introducing a new expectation for existing data, I have confirmed that existing data meets this expectation or I have made this expectation optional rather than required.

Performance, reliability and availability

  • Confirmed
  1. I am confident that this MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines)
  2. I have added information for database reviewers in the MR description, or I have decided that it is unnecessary. (Does this MR have database-related changes?)
  3. I have considered the availability and reliability risks of this change. I have also considered the scalability risk based on future predicted growth
  4. I have considered the performance, reliability and availability impacts of this change on large customers who may have significantly more data than the average customer.

Documentation

  1. I have included changelog trailers, or I have decided that they are not needed. (Does this MR need a changelog?)
  2. I have added/updated documentation, or I have decided that documentation changes are not needed for this MR. (Is documentation required?)

Security

  • Confirmed
  1. I have confirmed that if this MR contains changes to processing or storing of credentials or tokens, authorization, and authentication methods, or other items described in the security review guidelines, I have added the label security and I have @-mentioned @gitlab-com/gl-security/appsec.

Deployment

  • Confirmed
  1. I have considered using a feature flag for this change because the change may be high risk. If I decided to use a feature flag, I plan to test the change in staging before I test it in production, and I have considered rolling it out to a subset of production customers before doing rolling it out to all customers. When to use a feature flag
  2. I have informed the Infrastructure department of a default setting or new setting change per definition of done, or decided that this is not needed.
Edited by Quang-Minh Nguyen

Merge request reports

Loading