Remove docker-machine provision on creation failure
What does this MR do?
Remove docker-machine provision on creation failure
Why was this MR needed?
In
#27613 (closed) (comment 522938477)
we see machines failing to create on the cloud provider, but it fails
for rate limit reasons or anything else. This leads to a split-brain issue where we have state of a machine "created" in docker-machine
state but the machine is nonexistent/inaccessible to the cloud provider.
If the creation fails docker-machine provision
runs on a machine that
doesn't exist which wastes API requests, and time trying to wait for SSH
to be available on a machine. The docker-machine provision
tries to
re-provision an existing machine, but if creation fails that means that
the machine doesn't exist.
Instead of retrying to create/provision the machine with the same details
fail fast, remove the machine and try to create a new one with details.
This should result in much quicker recovery when we reach the API rate
limits on the cloud provider and also job queues because the machine
would be available sooner rather than waiting for docker-machine provision
to be available.
What's the best way to test this MR?
There is no easy way to force a creation failure mid creation, and it's not always reproducible because it needs to be at the moment where the local state exists and remote state doesn't. Below is an attempt at this.
-
Start
docker+machine
runner configured like below to provision VMs on GCP.config.toml
[[runners]] name = "docker-machine" url = "http://gdk.test:3000/" token = "" executor = "docker+machine" clone_url = "http://192.168.190.160:3000" [runners.docker] tls_verify = false image = "alpine:3.12" privileged = false disable_entrypoint_overwrite = false oom_kill_disable = false disable_cache = false volumes = ["/cache"] shm_size = 0 [runners.machine] IdleCount = 0 MachineDriver = "google" MachineName = "%s-steveazz" MachineOptions = ["google-project=group-verify-df9383", "google-disk-size=50", "google-disk-type=pd-ssd", "google-machine-type=n1-standard-1", "engine-registry-mirror=https://mirror.gcr.io"]
-
Open up the GCP console with the configured project opened.
-
Start a job, and as soon as you see
Waiting for Instance
in the runner logs, delete the machine in GCPChecking for jobs... received job=2202 repo_url=http://gdk.test:3000/root/playground.git runner=RP-KCb2f Running pre-create checks... driver=google name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=create (runner-rp-kcb2f-1615817278-568560df-steveazz) Check that the project exists driver=google name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=create (runner-rp-kcb2f-1615817278-568560df-steveazz) Check if the instance already exists driver=google name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=create Creating machine... driver=google name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=create (runner-rp-kcb2f-1615817278-568560df-steveazz) Generating SSH Key driver=google name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=create (runner-rp-kcb2f-1615817278-568560df-steveazz) Creating host... driver=google name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=create (runner-rp-kcb2f-1615817278-568560df-steveazz) Opening firewall ports driver=google name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=create (runner-rp-kcb2f-1615817278-568560df-steveazz) Creating instance driver=google name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=create (runner-rp-kcb2f-1615817278-568560df-steveazz) Waiting for Instance driver=google name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=create #<-------- Delete when you see this log message
-
You should see the following logs, where it tries to re-provision the machine and waiting for SSH to be available (this is a dead-end because the machine doesn't exist!)
ERROR: Error creating machine: Error in driver during machine creation: operation error: {RESOURCE_NOT_FOUND The resource 'projects/group-verify-df9383/zones/us-central1-a/instances/runner-rp-kcb2f-1615817278-568560df-steveazz' was not found [] []} driver=google name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=create WARNING: Machine creation failed, trying to provision error=exit status 1 name=runner-rp-kcb2f-1615817278-568560df-steveazz Waiting for SSH to be available... name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=provision
-
Terminate the runner
-
Delete local reference of the machine:
rm -rf ~/.docker/machine/machines/*
(make sure you don't have any other machine that you need) -
Start the process again but this time specify the
FF_SKIP_DOCKER_MACHINE_PROVISION_ON_CREATION_FAILURE
feature flagconfig.toml
[[runners]] name = "docker-machine-gpu" url = "https://gitlab.com/" token = "" executor = "docker+machine" [runners.feature_flags] "FF_SKIP_DOCKER_MACHINE_PROVISION_ON_CREATION_FAILURE" = true #<----- Enable feature flag [runners.docker] image = "ubuntu" privileged = false volumes = ["/cache"] [runners.machine] IdleCount = 0 MachineDriver = "google" MachineName = "steveazz-%s" MachineOptions = [ "google-project=group-verify-df9383", "google-disk-size=50", "google-disk-type=pd-ssd", "google-machine-type=n1-standard-1", "engine-registry-mirror=https://mirror.gcr.io", "engine-opt=mtu=1460", "engine-opt=ipv6", "engine-opt=fixed-cidr-v6=fc00::/7", ]
runner system log analysis
# Got job Checking for jobs... received job=1101059904 repo_url=https://gitlab.com/steveazz/playground.git runner=g_dsjkxZ # Creating first machine `runner-gdsjkxz-steveazz-1615878835-1abd984b` Running pre-create checks... driver=google name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=create (runner-gdsjkxz-steveazz-1615878835-1abd984b) Check that the project exists driver=google name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=create (runner-gdsjkxz-steveazz-1615878835-1abd984b) Check if the instance already exists driver=google name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=create Creating machine... driver=google name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=create (runner-gdsjkxz-steveazz-1615878835-1abd984b) Generating SSH Key driver=google name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=create (runner-gdsjkxz-steveazz-1615878835-1abd984b) Creating host... driver=google name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=create (runner-gdsjkxz-steveazz-1615878835-1abd984b) Opening firewall ports driver=google name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=create (runner-gdsjkxz-steveazz-1615878835-1abd984b) Creating instance driver=google name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=create (runner-gdsjkxz-steveazz-1615878835-1abd984b) Waiting for Instance driver=google name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=create # Machine deleted by hand in GCP console right now ERROR: Error creating machine: Error in driver during machine creation: operation error: {RESOURCE_NOT_FOUND The resource 'projects/group-verify-df9383/zones/us-central1-a/instances/runner-gdsjkxz-steveazz-1615878835-1abd984b' was not found [] []} driver=google name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=create # We can see the log from https://gitlab.com/gitlab-org/gitlab-runner/-/blob/4d60bcaab71374c026252605bfc5a9ecbd993294/executors/docker/machine/provider.go#L119 Skipping provision retry on failed machine error=exit status 1 name=runner-gdsjkxz-steveazz-1615878835-1abd984b ERROR: Machine creation failed error=exit status 1 name=runner-gdsjkxz-steveazz-1615878835-1abd984b time=3m15.250528188s WARNING: Requesting machine removal lifetime=3m15.250572207s name=runner-gdsjkxz-steveazz-1615878835-1abd984b now=2021-03-16 08:17:10.556802 +0100 CET m=+261.255227168 reason=Failed to create used=3m15.250572444s usedCount=0 WARNING: Stopping machine lifetime=3m24.401840107s name=runner-gdsjkxz-steveazz-1615878835-1abd984b reason=Failed to create used=47.110277ms usedCount=0 Stopping "runner-gdsjkxz-steveazz-1615878835-1abd984b"... name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=stop Running pre-create checks... driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create (runner-gdsjkxz-steveazz-1615879040-fb8f7eeb) Check that the project exists driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create ERROR: googleapi: Error 404: The resource 'projects/group-verify-df9383/zones/us-central1-a/instances/runner-gdsjkxz-steveazz-1615878835-1abd984b' was not found, notFound name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=stop WARNING: Error while stopping machine error=exit status 1 lifetime=3m26.05819186s name=runner-gdsjkxz-steveazz-1615878835-1abd984b reason=Failed to create used=1.703461871s usedCount=0 WARNING: Removing machine lifetime=3m26.058247639s name=runner-gdsjkxz-steveazz-1615878835-1abd984b reason=Failed to create used=1.70351632s usedCount=0 About to remove runner-gdsjkxz-steveazz-1615878835-1abd984b name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=remove WARNING: This action will delete both local reference and remote instance. name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=remove (runner-gdsjkxz-steveazz-1615878835-1abd984b) Deleting instance. name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=remove (runner-gdsjkxz-steveazz-1615879040-fb8f7eeb) Check if the instance already exists driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create # New machine creating instead of running `docker-machine provision` # that would have never succesdded Creating machine... driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create (runner-gdsjkxz-steveazz-1615879040-fb8f7eeb) Generating SSH Key driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create (runner-gdsjkxz-steveazz-1615879040-fb8f7eeb) Creating host... driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create (runner-gdsjkxz-steveazz-1615879040-fb8f7eeb) Opening firewall ports driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create (runner-gdsjkxz-steveazz-1615879040-fb8f7eeb) Creating instance driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create (runner-gdsjkxz-steveazz-1615878835-1abd984b) Remote instance does not exist, proceeding with removing local reference name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=remove Successfully removed runner-gdsjkxz-steveazz-1615878835-1abd984b name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=remove Machine removed lifetime=3m27.370799498s name=runner-gdsjkxz-steveazz-1615878835-1abd984b now=2021-03-16 08:17:22.676749 +0100 CET m=+273.375472400 reason=Failed to create retries=0 used=3.016069639s usedCount=0 (runner-gdsjkxz-steveazz-1615879040-fb8f7eeb) Waiting for Instance driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create (runner-gdsjkxz-steveazz-1615879040-fb8f7eeb) Uploading SSH Key driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create Waiting for machine to be running, this may take a few minutes... driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create Detecting operating system of created instance... driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create Waiting for SSH to be available... driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create Detecting the provisioner... driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create Provisioning with ubuntu(systemd)... driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create Installing Docker... driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create Copying certs to the local machine directory... driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create Copying certs to the remote machine... driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create Setting Docker configuration on the remote daemon... driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create Checking connection to Docker... driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create Docker is up and running! driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create To see how to connect your Docker Client to the Docker Engine running on this virtual machine, run: docker-machine env runner-gdsjkxz-steveazz-1615879040-fb8f7eeb driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create Skipping provision retry on failed machine error=<nil> name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb Machine created duration=2m16.766447559s name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb now=2021-03-16 08:19:37.427245 +0100 CET m=+408.129277143 retries=0 Created docker-machine created=2021-03-16 08:17:20.664122 +0100 CET m=+271.362795889 docker=tcp://34.70.78.2:2376 job=1101059904 name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb now=2021-03-16 08:19:38.198272 +0100 CET m=+408.900322644 project=10644153 runner=g_dsjkxZ usedcount=1 Starting docker-machine build... created=2021-03-16 08:17:20.664122 +0100 CET m=+271.362795889 docker=tcp://34.70.78.2:2376 job=1101059904 name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb now=2021-03-16 08:20:14.712678 +0100 CET m=+445.415625724 project=10644153 runner=g_dsjkxZ usedcount=1 Finished docker-machine build: <nil> created=2021-03-16 08:17:20.664122 +0100 CET m=+271.362795889 docker=tcp://34.70.78.2:2376 job=1101059904 name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb now=2021-03-16 08:20:29.597103 +0100 CET m=+460.300416482 project=10644153 runner=g_dsjkxZ usedcount=1 Cleaned up docker-machine created=2021-03-16 08:17:20.664122 +0100 CET m=+271.362795889 docker=tcp://34.70.78.2:2376 job=1101059904 name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb now=2021-03-16 08:20:30.086969 +0100 CET m=+460.790294736 project=10644153 runner=g_dsjkxZ usedcount=1 Job succeeded duration=6m34.78749474s job=1101059904 project=10644153 runner=g_dsjkxZ
What are the relevant issue numbers?
closes #27613 (closed)