Infradev: Improve job duration of urgent hourly jobs

Summary

Over the weekend, we've received a number of Sidekiq execution apdex drop alerts (approximately 13 minutes past the hour for several hours) for the urgent-other shard, which led me to declare an incident: gitlab-com/gl-infra/production#17084 (closed)

We'd like some assistance from your team to figure out what we can do to reduce execution time or maybe we need to adjust the target duration and/or urgency of the scheduled CI jobs and/or SLO?

Impact

17 pages in the last 2 weeks causing significant TOIL.
Scheduled CI jobs taking longer to complete, but given they are scheduled jobs, unlikely to have any noticeable impact to users.

This also impacts the error budget for pipeline execution, at the time of writing, this was the biggest contributor the budget spend:

src

The workers causing these are visible on the SLI detail dashboard:

src

The workers that seem to violate their target duration on this shard the most often are:

PipelineProcessWorker
PipelineMetricsWorker
Ci::CancelRedundantPipelinesWorker
Ci::InitialPipelineProcessWorker

Recommendation

Investigate what needs to be done to improve the execution time of these hourly jobs or adjust parameters to avoid these sharp hourly apdex drops that normally don't alert but eventually get bad enough to page for a few hours.

One way we could improve these workers is splitting them up: only perform what is strictly necessary on the urgent queue, and delegate other things to be done to a job with lower urgency.

Verification

Ensure we reduce the noise on these alerts to actual alerts that need attention. It seems like these alerts fire more on weekends than regular busy periods.

Ideally we'd flatten out the hourly drops in apdex visible on the error budget detail dashboard:

src

Edited Nov 09, 2023 by Gonzalo Servat