hpa downscaling webservice causing 502 errors with AWS Load Balancer
Problem to solve
Advice on configuring AWS Load Balancer when this is being used as a replacement for nginx-ingress
.
Current general recommendation is to use 75% as the minimum available webserive pods, customer would like to see this lowered for improved cost savings.
Configuration investigation for nginx-ingress
is being done in hpa downscaling webservice causing 502 errors with nginx-ingress
This issue is to investigate how that advice translates to AWS Load Balancer usage instead.
This documentation could maybe be placed in https://docs.gitlab.com/charts/charts/gitlab/webservice/
Further details
Proposal
Who can address the issue
Ingress configuration when using AWS Load Balancer
Other links/references
hpa downscaling webservice causing 502 errors with nginx-ingress
Customer request via ZD - internal to GitLab team members
Root cause and resolution
When Kubernetes terminates a Webservice Pod, the AWS Load Balancer appears to continue to briefly send new traffic to the Pod, even when the endpoint has been marked as "Draining" in the target group by the AWS Load Balancer Controller.
AWS recommends using a preStop hook to catch the SIGTERM
from Kubernetes and have the application return an "unhealthy" status and sleep for long enough to have the AWS Load Balancer recognize that the endpoint is unhealthy and remove it from the target group.
While the webservice container makes use of a preStop
hook to send a SIGINT
to the puma master process and sets a environment variable used by the Rails application to sleep for the period set by the shutdown.blackoutSeconds
value for the Webservice Chart, the gitlab-workhorse container responds to the SIGTERM
immediately and shuts down.
We've opened !3084 (merged) to add a preStop
sleep for the gitlab-workhorse container, using the same shutdown.blackoutSeconds
value. When used with an appropriate value for shutdown.blackoutSeconds
and implementing the correct AWS Load Balancer Controller healthcheck annotations for the webservice Ingress, this should mitigate most, if not all, of the 502
errors returned by the AWS Load Balancer when webservice pods terminate due to scale down events.
We'd recommend the following settings and annotations once !3084 (merged) is merged and released:
-
shutdown.blackoutSeconds: 30
(or longer to let long-running requests finish). -
deployment.terminationGracePeriodSeconds: 40
(must be longer thanshutdown.blackoutSeconds
The default for shutdown.blackoutSeconds
is 10s - this is too low for the minimum healthcheck interval for the AWS Load Balancer (which is 5s
minimum and 15s
by default).
Set the following for gitlab.webservice.ingress.annotations
(Note: these should be set specifically for webservice
and not as part of global.ingress.annotations
)
alb.ingress.kubernetes.io/healthcheck-path: "/-/readiness"
-
alb.ingress.kubernetes.io/healthcheck-interval-seconds: '10'
(Or any value less thanshutdown.blackoutSeconds
/ 2 - the default to mark the endpoint unhealthy is 2 subsequent 'unhealthy' responses) -
alb.ingress.kubernetes.io/healthcheck-timeout-seconds: '5'
(must be less thanhealthcheck-interval-seconds
)
We recommend testing these values with your usage patterns and traffic, and tune these values as appropriate for your environment.