Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

check_alive 503 with ALB

See original GitHub issue

Describe the bug We are using an ALB alongside a NodePort Service to route traffic to our Ambassador pods. We are seeing a pretty consistent small percentage (2-3%) of ALB health checks fail with a 503 UC.

We have 5 mappings, all of which have remained the same since we started adopting.

The number of 503s is pretty sporadic, but with ~25 nodes health checking down to 6 Ambassador pods via NodePort, we see about 1-2% of our total traffic volume (including the healthcheks) register as 503s in envoy. I can get more specific info or try more specific tests if need be. See below for the health check config.

To Reproduce Use the ALB ingress controller to setup an Ingress to point to your kind: Service of type: NodePort

Expected behavior The ALB health check passed without 503, as the pods are healthy and not crashing/restarting.

Here is the log line from Envoy

2019-12-09T20:54:05.000Z <POD_NAME> ACCESS [2019-12-09T20:54:04.729Z] "GET /ambassador/v0/check_alive HTTP/1.1" 503 UC 0 95 0 - "-" "ELB-HealthChecker/2.0" "e9b1934b-e62c-488b-a0a3-fb8f7935803b" "REDACTED:32149" "127.0.0.1:8877"

Versions (please complete the following information):

Ambassador: 0.85.0
Kubernetes on EC2 version 1.14.8

Additional context We have the health check on the ALB configured with the following: Screen Shot 2019-12-09 at 3 36 19 PM

When sending traffic via siege to the ALB DNS name, then to the check_alive path, all requests succeed. Also testing via a node in the cluster on its NodePort, all requests succeed. Any thoughts or guidance on how to narrow this down?

Note: regular traffic works as expected, Mappings and all. This seems to be only affecting the AWS ALB health check.

Issue Analytics

State:
Created 4 years ago
Reactions:4
Comments:5

Top GitHub Comments

1reaction

Rathgorecommented, Jan 7, 2020

We are experiencing the same issue in all of our environments running Ambassador v0.86.1, Kubernetes v1.16.4 and Calico v3.11. We have ALBs pointed at cluster workers via a NodePort service and the health checks are sporadically failing with a 503 error, showing up as a “UC” error code in Envoy’s access logs exactly as shown above. In this case, our ALB health check is routed to a Java application instance running Tomcat rather than Ambassador’s own health check.

There seems to be an issue, possibly with Envoy connection pooling as suggested that is causing requests originating from within the VPC to sporadically fail with a 503. We are also seeing sporadic 503s when we run our Helm chart test suite which spins up Job resources in the cluster that test external endpoints through the ALB. The failures behave like the well-documented issues with Envoy and HTTP keep alive connections, but no amount of adjusting the keep alive settings in all parts of the stack have fixed this issue. Even setting up retries doesn’t completely eliminate it.

Envoy reports that the upstream returns a connection refused (code 111) error:

[2020-01-06 22:21:55.881][147][debug][pool] [source/common/http/http1/conn_pool.cc:95] creating a new connection
[2020-01-06 22:21:55.881][147][debug][client] [source/common/http/codec_client.cc:31] [C1092] connecting
[2020-01-06 22:21:55.881][147][debug][connection] [source/common/network/connection_impl.cc:711] [C1092] connecting to 10.110.52.34:80
[2020-01-06 22:21:55.881][147][debug][connection] [source/common/network/connection_impl.cc:720] [C1092] connection in progress
[2020-01-06 22:21:55.881][147][debug][pool] [source/common/http/conn_pool_base.cc:20] queueing request due to no available connections
[2020-01-06 22:21:55.881][147][debug][connection] [source/common/network/connection_impl.cc:568] [C1092] delayed connection error: 111
[2020-01-06 22:21:55.881][147][debug][connection] [source/common/network/connection_impl.cc:193] [C1092] closing socket: 0
[2020-01-06 22:21:55.881][147][debug][client] [source/common/http/codec_client.cc:88] [C1092] disconnect. resetting 0 pending requests
[2020-01-06 22:21:55.881][147][debug][pool] [source/common/http/http1/conn_pool.cc:136] [C1092] client disconnected, failure reason: 
[2020-01-06 22:21:55.881][147][debug][pool] [source/common/http/http1/conn_pool.cc:167] [C1092] purge pending, failure reason: 
[2020-01-06 22:21:55.881][147][debug][router] [source/common/router/router.cc:911] [C1084][S12840280777660819839] upstream reset: reset reason connection failure
[2020-01-06 22:21:55.881][147][debug][http] [source/common/http/conn_manager_impl.cc:1354] [C1084][S12840280777660819839] Sending local reply with details upstream_reset_before_response_started{connection failure}

This seems to only happen for traffic originating from within the VPC. There are no errors when hitting the services directly via their ClusterIPs and there are no errors for requests originating outside the VPC like customer API calls and external health monitoring that hit the same endpoints.

My next steps are to see if the EA release of Ambassador Edge Stack fixes it. If that doesn’t work, I’ll start analyzing packet captures between the pods.

1reaction

kphelpscommented, Dec 10, 2019

I’ve done some additional digging into this with @joshbranham. It seems increasing the gunicorn keepalive setting significantly reduces the frequency of these 503s (although we still see one occasionally). I’m guessing there is some interaction between gunicorn closing connections and Envoy’s connection pool causing these 503s. Thoughts? I’d be happy to open a PR if this is something other people are running into.