Auto-retry jobs when they failed due to a known flaky test (!168541) · Merge requests · GitLab.org / GitLab

David Dieulivol requested to merge ddieulivol-test_retry_with_exit_status_codes into master Oct 08, 2024

From Draft to Ready

The found_known_flaky_tests bash function is implemented
MR description is up-to-date, and I tested all the edge-cases I want to test
gitlab-org/ruby/gems/gitlab_quality-test_tooling!260 (merged) is merged, released, and we use that newly released version in this MR

What does this MR do and why?

Auto-retry a CI/CD job (i.e. set a custom exit code for the job) when we detect it failed due to a known flaky test that made a CI/CD job fail on the main branch (i.e. issues with the ~"test-health:failures" label).
Write a comment to this issue whenever a CI/CD job is about to fail after two RSpec processes.

Proof of work

When `$CI_AUTO_RETRY_JOBS_WITH_FLAKY_TESTS_ENABLED = true`

When the CI/CD job failed due to a known flaky test

When `$CI_AUTO_RETRY_JOBS_WITH_FLAKY_TESTS_NOTIFICATIONS_ENABLED = true`

Using #499936 as a reference.

Test commit to make the test above fail on purpose.

Expected:
- The job should be auto-retried
- We should see a comment for that job in gitlab-org/quality/engineering-productivity/team#573.
Actual:
- The jobs were auto-retried (e.g. 1,2,3. There were more) 🎉
- There are comments in the reporting issue (1,2,3) 🎉
  - Note: It's also the case for FOSS jobs (e.g. see gitlab-org/quality/engineering-productivity/team#573 (comment 2170766395))

When `$CI_AUTO_RETRY_JOBS_WITH_FLAKY_TESTS_ENABLED = false`

Expected:
- It doesn't change the status code
- No comment for that job in gitlab-org/quality/engineering-productivity/team#573.
Actual:
- The logic wasn't run: https://gitlab.com/gitlab-org/gitlab/-/jobs/8150778528#L1703 (so no detection/comments in issue)
- The exit code was 1: https://gitlab.com/gitlab-org/gitlab/-/jobs/8150778528#L1806

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

Before	After

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

Edited Oct 23, 2024 by David Dieulivol

Auto-retry jobs when they failed due to a known flaky test