When GCP rate limits are reached runner keeps waiting for machine to be created until timeout [Docker Machine executor]
Release notes
With this feature, if there is a failure creating a virtual machine, the GitLab Runner Docker Machine executor will skip executing the docker-machine provisioning step and instead removes the failed VM and creates a new one.
Overview
In incidents like gitlab-com/gl-infra/production#3712 (closed) where a massive backlog of jobs is stalled and runner starts picking up a large number of jobs at the same time we end up reaching Google API rate limits like error in driver during machine creation: requesting firewall rule: googleapi: Error 403: Quota exceeded for quota group 'ReadGroup' and limit 'Read requests per 100 seconds' of service 'compute.googleapis.com' for consumer 'project_number:xxx'., rateLimitExceeded
.
However, when we get that error message it seems like we are still waiting for SSH on the "created machine" (it wasn't created because of the limit) which prevents us from creating a new machine again and end up with a bunch of fake machines waiting for SSH to become available.
Notice the log above:
- Tried to create a machine at
Feb 25, 2021 @ 04:41:42.298
- We got rate limited at
Feb 25, 2021 @ 04:41:42.316
- We ended up waiting for a machine to be removed until
Feb 25, 2021 @ 05:01:31.383
. That is 20 minutes wasted waiting for a machine to come up when it didn't exist.