check_alive 503 with ALB
See original GitHub issueDescribe the bug
We are using an ALB alongside a NodePort
Service to route traffic to our Ambassador pods. We are seeing a pretty consistent small percentage (2-3%) of ALB health checks fail with a 503 UC.
We have 5 mappings, all of which have remained the same since we started adopting.
The number of 503s is pretty sporadic, but with ~25 nodes health checking down to 6 Ambassador pods via NodePort
, we see about 1-2% of our total traffic volume (including the healthcheks) register as 503s in envoy. I can get more specific info or try more specific tests if need be. See below for the health check config.
To Reproduce
Use the ALB ingress controller to setup an Ingress
to point to your kind: Service
of type: NodePort
Expected behavior The ALB health check passed without 503, as the pods are healthy and not crashing/restarting.
Here is the log line from Envoy
2019-12-09T20:54:05.000Z <POD_NAME> ACCESS [2019-12-09T20:54:04.729Z] "GET /ambassador/v0/check_alive HTTP/1.1" 503 UC 0 95 0 - "-" "ELB-HealthChecker/2.0" "e9b1934b-e62c-488b-a0a3-fb8f7935803b" "REDACTED:32149" "127.0.0.1:8877"
Versions (please complete the following information):
- Ambassador: 0.85.0
- Kubernetes on EC2 version 1.14.8
Additional context We have the health check on the ALB configured with the following:
When sending traffic via siege
to the ALB DNS name, then to the check_alive
path, all requests succeed. Also testing via a node in the cluster on its NodePort
, all requests succeed. Any thoughts or guidance on how to narrow this down?
Note: regular traffic works as expected, Mappings
and all. This seems to be only affecting the AWS ALB health check.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:4
- Comments:5
Top GitHub Comments
We are experiencing the same issue in all of our environments running Ambassador v0.86.1, Kubernetes v1.16.4 and Calico v3.11. We have ALBs pointed at cluster workers via a
NodePort
service and the health checks are sporadically failing with a 503 error, showing up as a “UC” error code in Envoy’s access logs exactly as shown above. In this case, our ALB health check is routed to a Java application instance running Tomcat rather than Ambassador’s own health check.There seems to be an issue, possibly with Envoy connection pooling as suggested that is causing requests originating from within the VPC to sporadically fail with a 503. We are also seeing sporadic 503s when we run our Helm chart test suite which spins up
Job
resources in the cluster that test external endpoints through the ALB. The failures behave like the well-documented issues with Envoy and HTTP keep alive connections, but no amount of adjusting the keep alive settings in all parts of the stack have fixed this issue. Even setting up retries doesn’t completely eliminate it.Envoy reports that the upstream returns a connection refused (code 111) error:
This seems to only happen for traffic originating from within the VPC. There are no errors when hitting the services directly via their ClusterIPs and there are no errors for requests originating outside the VPC like customer API calls and external health monitoring that hit the same endpoints.
My next steps are to see if the EA release of Ambassador Edge Stack fixes it. If that doesn’t work, I’ll start analyzing packet captures between the pods.
I’ve done some additional digging into this with @joshbranham. It seems increasing the gunicorn
keepalive
setting significantly reduces the frequency of these 503s (although we still see one occasionally). I’m guessing there is some interaction between gunicorn closing connections and Envoy’s connection pool causing these 503s. Thoughts? I’d be happy to open a PR if this is something other people are running into.