Fix readinessProbe for start modsecurity ingress sidecar
Summary
There's a minor bug in our rollout of the logging sidecar for modsecurity for the ingress managed app. The container restarts several times, likely failing readiness checks. We should improve our readinessProbe to ensure there are no such restarts and a proper delay is given.
Steps to reproduce
Deploy ingress gitlab-managed-app
Example Project
https://gitlab.com/gitlab-org/defend/waf-enablement-demo/
What is the current bug behavior?
Installation of ingress results in several restarts when spinning up sidecar logging container
What is the expected correct behavior?
Installation of ingress should occur smoothly and result in zero container terminations
Relevant logs and/or screenshots
❯ kubectl get pods -n gitlab-managed-apps ingress-nginx-ingress-controller-c64566b4d-dc22v master NAME READY STATUS RESTARTS AGE ingress-nginx-ingress-controller-c64566b4d-dc22v 2/2 Running 2 122m❯ kubectl describe pod -n gitlab-managed-apps ingress-nginx-ingress-controller-c64566b4d-dc22v master Name: ingress-nginx-ingress-controller-c64566b4d-dc22v Namespace: gitlab-managed-apps Priority: 0 PriorityClassName: Node: gke-waf-enablement-demo-default-pool-ae892c4d-f73b/10.128.0.54 Start Time: Fri, 22 Nov 2019 14:28:07 -0800 Labels: app=nginx-ingress component=controller pod-template-hash=c64566b4d release=ingress Annotations: prometheus.io/port: 10254 prometheus.io/scrape: true Status: Running IP: 10.41.1.7 Controlled By: ReplicaSet/ingress-nginx-ingress-controller-c64566b4d Containers: nginx-ingress-controller: Container ID: docker://9393f9ba413d6ca1621846595196b0cca8359928473fffb9d58122b16a7ea4e7 Image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.25.1 Image ID: docker-pullable://quay.io/kubernetes-ingress-controller/nginx-ingress-controller@sha256:0c4941fa8c812dd44297b5f4900e3b26c3e6a8a42940e48fe9a1a585fe8f7e25 Ports: 80/TCP, 443/TCP Host Ports: 0/TCP, 0/TCP Args: /nginx-ingress-controller --default-backend-service=gitlab-managed-apps/ingress-nginx-ingress-default-backend --election-id=ingress-controller-leader --ingress-class=nginx --configmap=gitlab-managed-apps/ingress-nginx-ingress-controller State: Running Started: Fri, 22 Nov 2019 14:28:23 -0800 Ready: True Restart Count: 0 Liveness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3 Readiness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3 Environment: POD_NAME: ingress-nginx-ingress-controller-c64566b4d-dc22v (v1:metadata.name) POD_NAMESPACE: gitlab-managed-apps (v1:metadata.namespace) Mounts: /etc/nginx/modsecurity/modsecurity.conf from modsecurity-template-volume (rw) /var/log/modsec from modsecurity-log-volume (rw) /var/run/secrets/kubernetes.io/serviceaccount from ingress-nginx-ingress-token-tbvc5 (ro) modsecurity-log: Container ID: docker://d060252038bc5c56505a67e71171a81a9d0c7636788e4fead009f626889231bb Image: busybox Image ID: docker-pullable://busybox@sha256:679b1c1058c1f2dc59a3ee70eed986a88811c0205c8ceea57cec5f22d2c3fbb1 Port: Host Port: Args: /bin/sh -c tail -f /var/log/modsec/audit.log State: Running Started: Fri, 22 Nov 2019 14:28:40 -0800 Last State: Terminated Reason: Error Exit Code: 1 Started: Fri, 22 Nov 2019 14:28:24 -0800 Finished: Fri, 22 Nov 2019 14:28:24 -0800 Ready: True Restart Count: 2 Environment: Mounts: /var/log/modsec from modsecurity-log-volume (ro) /var/run/secrets/kubernetes.io/serviceaccount from ingress-nginx-ingress-token-tbvc5 (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: modsecurity-template-volume: Type: ConfigMap (a volume populated by a ConfigMap) Name: ingress-nginx-ingress-controller Optional: false modsecurity-log-volume: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: ingress-nginx-ingress-token-tbvc5: Type: Secret (a volume populated by a Secret) SecretName: ingress-nginx-ingress-token-tbvc5 Optional: false QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events:
Output of checks
This bug happens on GitLab.com
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:env:info
)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production
)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true
)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true
)(we will only investigate if the tests are passing)
Possible fixes
(If you can, link to the line of code that might be responsible for the problem)