Flaky test: TestRunIntegrationTestsWithFeatureFlag in integration_k8s test suite
The integration_k8s
is recently frequently failing randomly on the TestRunIntegrationTestsWithFeatureFlag
test. The example can be found at https://gitlab.com/gitlab-org/gitlab-runner/-/jobs/1833454016#L850 with the specific failure at https://gitlab.com/gitlab-org/gitlab-runner/-/jobs/1833454016#L1410.
From what I've seen, the failure is each time the same:
--- FAIL: TestRunIntegrationTestsWithFeatureFlag/testKubernetesGarbageCollection_FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY_false/pod_deletion_during_prepare_stage_in_custom_namespace (0.00s)
kubernetes_integration_test.go:724:
Error Trace: kubernetes_integration_test.go:724
kubernetes_integration_test.go:767
Error: Received unexpected error:
The POST operation against Namespace could not be completed at this time, please try again.
Test: TestRunIntegrationTestsWithFeatureFlag/testKubernetesGarbageCollection_FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY_false/pod_deletion_during_prepare_stage_in_custom_namespace
From what I see the timeout here is defined by a context and it's set for 1 minute. So for some reason the namespace creation operation is sometimes taking more than a minute in the test environment. Which randomly fails the test -> job -> pipeline.
As a short-term workaround we will add an automatic retry to the integration_k8s
test suite. But as it adds potentially up to 30 minutes more to the pipeline execution time, we should find out what is causing the randomized failures here and fix the problem properly.