Use health probes for docker service startup
Description
For services we:
- Start the service.
- Guess the port the service might be using (the first port from the images metadata)
- Start a new container running the helper binary's
healthcheck
command, which either eventually fails or succeeds. The context deadline eventually cancels this container if the process never succeeds.
The issues with this approach are:
- Guessing a port is sometimes wrong. A Dockerfile can have multiple ports defined, but the service doesn't have to use the first one, the last one, or any of them in practice.
- A new container performing the health check is created for each service.
- We only support a TCP connection attempt to determine if a service is healthy. Just because a service is listening, doesn't mean that it is ready.
Proposal
Perform health checks similar to Kubernetes' startup probes.
Support:
- TCP Probe
- HTTP Get Probe
- Exec Probe
The docker executor will always technically use the Exec Probe:
- For a config defined
exec
probe, the Exec Probe will be configured based on the supplied requirements - For a config defined
tcp
orhttp_get
probe, the Exec Probe is configured to run the helper binary process within a container that has already been started (but in a loop to keep it alive).
Below is an example covering the settings available to each probe. A service can define multiple probes, they will be executed in order and if any fail after the defined retries/timeouts, the service will be considered to not have started correctly.
job:
services:
- name: service1
probes:
- tcp:
port: 8080
exec:
command: ["/bin/check"]
retries: 10 # optional
initial_delay: 5 # optional
period: 10 # optional
timeout: 10 # optional
- http_get:
path: /health
port: 8080
headers:
- 'X-Custom-Header: custom'
- exec:
command: ["/bin/check/another/thing"]
- name: service2
probes:
- http_get:
path: /health
port: 8080
headers:
- 'X-Custom-Header: custom'
initial_delay: 5 # optional
period: 10 # optional
timeout: 10 # optional
- name: service3
probes:
- exec:
command: ["/bin/check"]
retries: 10 # optional
initial_delay: 5 # optional
period: 10 # optional
timeout: 10 # optional
This task can be broken into the following steps:
-
!2238 (closed) - Implement all probes alongside existing behaviour. Only support
NetworkPerBuild
networking mode and support what we already support (TCP Probe only). - Support more than TCP Probe
- Update
gitlab-ci.yml
to support all probes. - Update runner to make use of the already implemented probes.
- Update
- Disable Docker's built in HEALTHCHECK support.
Links to related issues and merge requests / references
This came from a discussion in !1195 (comment 140928174)
This can also solve the following bugs:
Edited by Arran Walker