Root Cause Analysis - Pipelines failing as a result of project includes missing

Please note: if the incident relates to sensitive data, or is security related consider labeling this issue with security and mark it confidential.

Summary

feature flag rollout - #392746 (closed)
Additional details can be found in #382751 (closed)

A brief summary of what happened. Try to make it as executive-friendly as possible.

Service(s) affected: Gitlab.com
Team attribution: grouppipeline authoring
Minutes downtime or degradation: Degradation between rollout and timestamp of customer reporting a problem was 149 minutes (12:11UTC - 14:40UTC)

Impact & Metrics

Start with the following:

Question	Answer
What was the impact	Pipelines would fail as it would be attempting to access something that did not exist for project includes.
Who was impacted	Any customers using project include references
How did this impact customers	Pipelines would fail with a message saying `our CI template project (or tag) do not exist.`
How many attempts made to access	N/A
How many customers affected	Unknown - any customers using project include references. 3 reports ( https://gitlab.zendesk.com/agent/tickets/383278, https://gitlab.zendesk.com/agent/tickets/383292, https://gitlab.zendesk.com/agent/tickets/383259) came through notifying of the problem
How many customers tried to access	Unknown

Include any additional metrics that are of relevance.

Provide any relevant graphs that could help understand the impact of the incident and its dynamics.

Detection & Response

Start with the following:

Question	Answer
When was the incident detected?	Initial report of problem reported on 2023-03-08 14:40UTC in Zendesk
How was the incident detected?	Reported to support who triaged issue and support engineer contacted @furkanayhan
Did alarming work as expected?	N/A
How long did it take from the start of the incident to its detection?	2023-03-08 12:11UTC -> 2023-03-08 14:40UTC (initially reported by customer - but may have been detected earlier than 2023-03-08 14:40UTC by other Saas customers)
How long did it take from detection to remediation?	TBD
What steps were taken to remediate?	See below
Were there any issues with the response?	corrective action issue has been created and associated MR will improve test coverage in the future to prevent future incidents.

MR Checklist

Consider these questions if a code change introduced the issue.

Question	Answer
Was the MR acceptance checklist marked as reviewed in the MR?	Yes - All relates MRs acceptance checklists were marked as reviewed
Should the checklist be updated to help reduce chances of future recurrences? If so, who is the DRI to do so?

Timeline

NOTE: all times in UTC.

2023-03-08 12:11 - feature flag rollout begins in #392746 (comment 1305426467)
2023-03-08 14:40 - Customer informs support that their pipelines are failing saying that our CI template project (or tag) do not exist.
2023-03-08 15:21 - feature flag disabled and investigation on GitLab side begins
2023-03-08 20:44 - corrective action issue created to improve test coverage
2023-03-10 14:19 - MR created to create new end-to-end test to improve test coverage.
2023-03-10 14:28 - Initial and Maintainer review requests initiated.
2023-03-13 11:36 - Initial MR review completed ✅

Root Cause Analysis

The purpose of this document is to understand the reasons that caused an incident, and to create mechanisms to prevent it from recurring in the future. A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.

Follow the "5 whys" in a blameless manner as the core of the root-cause analysis.

For this, it is necessary to start with the incident and question why it happened. Keep iterating asking "why?" 5 times. While it's not a hard rule that it has to be 5 times, it helps to keep questions get deeper in finding the actual root cause.

Keep in mind that from one "why?" there may come more than one answer, consider following the different branches.

"5 whys"

Pipelines are failing, saying that our CI template project (or tag) do not exist. Why?
- A code change was deployed as a part of a feature flag rollout and the bug was that we couldn't manage to use BatchLoader correctly and we were trying to use the same sha for every project file fetching in a single CI config.
Why did this bug not get noticed in staging?
- Our tests did not cover the bug because we always use the same repository for projects create(:project, :repository).
Why is an integration test for this use case missing?
- Since an end-to-end test was always using the same repository for projects, a corrective action issue was created to improve our code coverage tests.
Why does this not caught in the review of the original MR?
- Reviewers were not familiar with the specific changes that would cause this bug.
Why did it take over 2 hours to resolve this issue in production?
- 3 tickets did not come in to support immediately which suggested that the feature flag rollout for 1% of actors to start was working as expected. Once tickets were received and triaging begin, the feature flag was immediately disabled to investigate further at the 2 hour mark.

What went well

Start with the following:

Identify the things that worked well or as expected.
- By releasing this code behind a feature flag, once the problem was reported, we were able to immediately disable the feature flag to prevent any further confusing messaging to users.
- It was rolled out to 1% of actors so a small sub-set of customers were affected.
- Between the time of when the feature flag rolled out and it being disabled, 2 hours was relatively quickly to be notified, begin triaging and prevent an incident from being reported via PagerDuty.
Any additional call-outs for what went particularly well.
- @furkanayhan was able to jump in immediately to disable feature flag and begin investigation.
- The coordination between Support and Engineering was exceptional (About 40 minutes between first reports being received to finding the appropriate group and having proper team members to support the disabling of the feature flag)

What can be improved

Start with the following:

Using the root cause analysis, explain what can be improved to prevent this from happening again.
Is there anything that could have been done to improve the detection or time to detection?
- Having better test coverage which is what #395699 (closed) will do.
Is there anything that could have been done to improve the response or time to response?
- Response time was pretty efficient between when first reported to Engineering assisting.
Is there an existing issue that would have either prevented this incident or reduced the impact?
- Having additional E2E testing would have prevented users from receiving confusing messages when their pipeline failed.
Did we have any indication or beforehand knowledge that this incident might take place?
- No comments were raised in the original approved MR to indicate any concerns.
Was the MR acceptance checklist marked as reviewed in the MR?
- Original MR checklist was approved by maintainer.
Should the checklist be updated to help reduce chances of future recurrences?
- Checklist improvements would not have caught this issue but ensuring quality test coverage is in place in the future, which #395699 (closed) will do.

Corrective actions

List issues that have been created as corrective actions from this incident.
Issue: #382751 (closed) (main issue) #392746 (closed) ( feature flag rollout ) #395699 (closed) ( corrective action )
Estimated Date of Completion of the corrective action: 4 business days between MR being created and MR being merged.
DRI to deliver corrective action: @furkanayhan

Guidelines

Edited Mar 14, 2023 by Mark Nuzzo