Infradev: Improve job duration of urgent hourly jobs
Summary
Over the weekend, we've received a number of Sidekiq execution apdex drop alerts (approximately 13 minutes past the hour for several hours) for the urgent-other
shard, which led me to declare an incident: gitlab-com/gl-infra/production#17084 (closed)
We'd like some assistance from your team to figure out what we can do to reduce execution time or maybe we need to adjust the target duration and/or urgency of the scheduled CI jobs and/or SLO?
Impact
- 17 pages in the last 2 weeks causing significant TOIL.
- Scheduled CI jobs taking longer to complete, but given they are scheduled jobs, unlikely to have any noticeable impact to users.
This also impacts the error budget for pipeline execution, at the time of writing, this was the biggest contributor the budget spend:
The workers causing these are visible on the SLI detail dashboard:
The workers that seem to violate their target duration on this shard the most often are:
PipelineProcessWorker
PipelineMetricsWorker
Ci::CancelRedundantPipelinesWorker
Ci::InitialPipelineProcessWorker
Recommendation
Investigate what needs to be done to improve the execution time of these hourly jobs or adjust parameters to avoid these sharp hourly apdex drops that normally don't alert but eventually get bad enough to page for a few hours.
One way we could improve these workers is splitting them up: only perform what is strictly necessary on the urgent queue, and delegate other things to be done to a job with lower urgency.
Verification
Ensure we reduce the noise on these alerts to actual alerts that need attention. It seems like these alerts fire more on weekends than regular busy periods.
Ideally we'd flatten out the hourly drops in apdex visible on the error budget detail dashboard: