Pass pipeline with warnings when failed rspec job is retried successfully

Problem to solve

In an effort to reduce test flakiness, when an rspec job fails in the gitlab-org/gitlab project, these jobs are retried in a new rspec process as part of [Experiment] Retry failed specs in a new proces... (gitlab-org/quality/quality-engineering/team-tasks#1148 - closed). However, when these jobs are retried and successfully pass, no indication is given to the developer that the test in their merge request was flaky, so there's a good chance that the flaky test will be merged anyway. This exact situation happened as described in this comment.

In order to make flaky tests more visible and prevent them from being merged into master in the first place, we can take advantage of the allow_failure:exit_codes feature to exit with a specific error code when an rspec job fails and then passes after being retried. This indicates a flaky test which needs to be investigated further.

Proposal

We can implement the following behaviour to handle this:

If an rspec job succeeds, it exits with status 0.
If an rspec job fails, and the retried job fails as well, it exits with status 1.
If an rspec job fails, and the retried job succeeds, and the test that caused the failure belongs to the MR being tested, it exits with status 137 (for example).

This will cause the pipeline to "pass with warnings", indicating that the developer needs to investigate further before merging this MR.

Benefits

Less chance of flaky tests being merged into master.
Less time spent responding to flaky test alerts and fixing broken master builds.
Less chance of a broken-master fix implementing an incorrect test or other issues.

By making the flaky test visible at the merge request level, this makes it much easier for the original code author to fix, versus someone else who's unfamiliar with the code.

Implementation Plan

Determine which exit code to use for successfully retried tests, for example: 137.

A new variable SUCCESSFULLY_RETRIED_TEST_EXIT_CODE was introduced, and 137 was chosen for this value.
Update .rspec-base to add a new variable with the exit code determined in step 1. above, such as: SUCCESSFULLY_RETRIED_TEST_EXIT_CODE: 137, and set allow_failure: exit_codes to use this value:
```
variables:
  SUCCESSFULLY_RETRIED_TEST_EXIT_CODE: 137
allow_failure:
  exit_codes: 137
```
Update the retry_failed_rspec_examples function to exit with the error code defined by SUCCESSFULLY_RETRIED_TEST_EXIT_CODE if the following conditions are met:
1. The test passes after being retried.
2. The test that originally failed before being retried was part of the MR that triggered the pipeline.
Demonstrate that the new changes work as expected:
- If a test fails and is retried, but is not part of the MR that triggered the pipeline, the pipeline should pass without warnings.
- If a test fails and is retried, and is part of the MR that triggered the pipeline, the pipeline should pass with warnings.
Tested here.
Make maintainers aware of this change so they don't approve MRs whose pipeline has passed with warnings without proper justification.

/cc @rymai

Edited Dec 08, 2022 by Adam Cohen