review-apps-ee cluster to support more frequent deployment
Hypothesis: when deployments happen too frequently, deployment starts to fail.
Some data supporting this hypothesis:
-
Deployment success rate is very high during the weekend, when there are not many deployments
-
Review app deployments changed to use DAG some time in early September, which increases the likelihood of the deploy job being executed. This was followed by a drop in success rate on the same day.
-
gitlab
3b9238b7 -
gitlab-foss
gitlab-foss!32366 (diffs)
-
We have been tweaking pod level parameters and was able to get some improvement, but there is still a lot more room for improvement. I think to get any further significant improvement on the success rate, we should switch gear to improve the cluster.
Some initial thoughts I have:
- increase timeout from 15 minutes to 20 or 30 minutes, this is to allow time for the cloud provider GCP to autoscale
- tune the cluster autoscaling rules to preempt deployments, i.e start new nodes earlier, instead of waiting for a deployment to request for more nodes
/cc @gl-quality/eng-prod Thoughts?
Edited by Albert Salim