Draft: Enforce CI minutes quota for running jobs (!57731) · Merge requests · GitLab.org / GitLab

Fabio Pitino requested to merge fp-enforce-minutes-quota-for-running-jobs into master Mar 29, 2021

What does this MR do?

This MR introduces a monitoring and enforcement of CI minutes usage for running builds.

Since we accumulate CI minutes consumption into namespace_statistics.shared_runners_seconds when builds complete, we can have pipelines with very long running builds that can cause the CI minutes consumption to exceed by far the limit set on the root namespace.

To limit this we need to monitor the CI minutes consumption on running builds and enforce the limit by dropping builds when the limit is exceeded by a 20.minutes grace period.

Because this operation is expensive (running at root namespace level) we have a few layers of checks to ensure we don't do any overprocessing:

allow the check to be scheduled every 5 minutes per project
allow the check to run exclusively every 5 minutes at namespace level (out of the multiple schedules from 1. above, only 1 actually runs)
skip processing if not on Gitlab.com
skip processing if project does not have shared runners enabled or is public
skip processing if project is on any paid plans (no trial)
only consider cancelable builds in recent cancelable pipelines

TODO:

add remaining specs
add feature flag
do E2E manual QA

Query plans

Builds in namespace being run by shared runners

Executes a query per batch of builds.

::Ci::Build
  .running
  .from_shared_runners
  .for_project(root_namespace.all_projects)
  .updated_after(RUNNING_BUILDS_SINCE_TIME.ago)
  .each_batch { ... }

https://console.postgres.ai/gitlab/gitlab-production-tunnel/sessions/3192/commands/10427

Time: 24.021 ms
  - planning: 23.447 ms
  - execution: 0.574 ms
    - I/O read: N/A
    - I/O write: N/A

Shared buffers:
  - hits: 4 (~32.00 KiB) from the buffer pool
  - reads: 0 from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

All projects in namespace

root_namespace.all_projects.find_each { ... }

https://console.postgres.ai/gitlab/gitlab-production-tunnel/sessions/3246/commands/10618

Time: 20.115 ms
  - planning: 3.249 ms
  - execution: 16.866 ms
    - I/O read: N/A
    - I/O write: N/A

Shared buffers:
  - hits: 4097 (~32.00 MiB) from the buffer pool
  - reads: 0 from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

Online specific runners that can be used by a project

executed for each project in the namespace having builds to drop.

Ci::Runner.specific_for_project(project).with_tags.online.to_a

Time: 34.966 ms
  - planning: 5.353 ms
  - execution: 29.613 ms
    - I/O read: 28.266 ms
    - I/O write: N/A

Shared buffers:
  - hits: 61 (~488.00 KiB) from the buffer pool
  - reads: 9 (~72.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 1 (~8.00 KiB)
  - writes: 0

https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/3246/commands/10619

Recent cancelable pipelines for project

::Ci::Pipeline
  .for_project(project)
  .cancelable
  .updated_after(ALIVE_BUILDS_SINCE_TIME.ago)
  .each_batch(of: 100) { ... }

https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/3246/commands/10621

Time: 0.961 ms
  - planning: 0.601 ms
  - execution: 0.360 ms
    - I/O read: N/A
    - I/O write: N/A

Shared buffers:
  - hits: 62 (~496.00 KiB) from the buffer pool
  - reads: 0 from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

Recent alive builds in pipelines

::Ci::Build.in_pipelines(pipelines)
  .running_or_pending_or_created
  .updated_after(ALIVE_BUILDS_SINCE_TIME.ago)
  .in_batches(of: 150) { ... }

For the plan I used in_batches(of: 3) to test the first batch.

https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/3246/commands/10622

Time: 85.662 ms
  - planning: 19.410 ms
  - execution: 66.252 ms
    - I/O read: 65.670 ms
    - I/O write: N/A

Shared buffers:
  - hits: 22 (~176.00 KiB) from the buffer pool
  - reads: 10 (~80.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 1 (~8.00 KiB)
  - writes: 0

Does this MR meet the acceptance criteria?

Conformity

📋 Does this MR need a changelog?
- I have included a changelog entry.
- I have not included a changelog entry because _____.
Documentation (if required)
Code review guidelines
Merge request performance guidelines
Style guides
Database guides
Separation of EE specific content

Availability and Testing

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
Tested in all supported browsers
Informed Infrastructure department of a default or new setting change, if applicable per definition of done

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

Label as security and @ mention @gitlab-com/gl-security/appsec
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
Security reports checked/validated by a reviewer from the AppSec team

Edited Apr 19, 2021 by Fabio Pitino

Draft: Enforce CI minutes quota for running jobs