question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

check_alive 503 with ALB

See original GitHub issue

Describe the bug We are using an ALB alongside a NodePort Service to route traffic to our Ambassador pods. We are seeing a pretty consistent small percentage (2-3%) of ALB health checks fail with a 503 UC.

We have 5 mappings, all of which have remained the same since we started adopting.

The number of 503s is pretty sporadic, but with ~25 nodes health checking down to 6 Ambassador pods via NodePort, we see about 1-2% of our total traffic volume (including the healthcheks) register as 503s in envoy. I can get more specific info or try more specific tests if need be. See below for the health check config.

To Reproduce Use the ALB ingress controller to setup an Ingress to point to your kind: Service of type: NodePort

Expected behavior The ALB health check passed without 503, as the pods are healthy and not crashing/restarting.

Here is the log line from Envoy

2019-12-09T20:54:05.000Z <POD_NAME> ACCESS [2019-12-09T20:54:04.729Z] "GET /ambassador/v0/check_alive HTTP/1.1" 503 UC 0 95 0 - "-" "ELB-HealthChecker/2.0" "e9b1934b-e62c-488b-a0a3-fb8f7935803b" "REDACTED:32149" "127.0.0.1:8877"

Versions (please complete the following information):

  • Ambassador: 0.85.0
  • Kubernetes on EC2 version 1.14.8

Additional context We have the health check on the ALB configured with the following: Screen Shot 2019-12-09 at 3 36 19 PM

When sending traffic via siege to the ALB DNS name, then to the check_alive path, all requests succeed. Also testing via a node in the cluster on its NodePort, all requests succeed. Any thoughts or guidance on how to narrow this down?

Note: regular traffic works as expected, Mappings and all. This seems to be only affecting the AWS ALB health check.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:4
  • Comments:5

github_iconTop GitHub Comments

1reaction
Rathgorecommented, Jan 7, 2020

We are experiencing the same issue in all of our environments running Ambassador v0.86.1, Kubernetes v1.16.4 and Calico v3.11. We have ALBs pointed at cluster workers via a NodePort service and the health checks are sporadically failing with a 503 error, showing up as a “UC” error code in Envoy’s access logs exactly as shown above. In this case, our ALB health check is routed to a Java application instance running Tomcat rather than Ambassador’s own health check.

There seems to be an issue, possibly with Envoy connection pooling as suggested that is causing requests originating from within the VPC to sporadically fail with a 503. We are also seeing sporadic 503s when we run our Helm chart test suite which spins up Job resources in the cluster that test external endpoints through the ALB. The failures behave like the well-documented issues with Envoy and HTTP keep alive connections, but no amount of adjusting the keep alive settings in all parts of the stack have fixed this issue. Even setting up retries doesn’t completely eliminate it.

Envoy reports that the upstream returns a connection refused (code 111) error:

[2020-01-06 22:21:55.881][147][debug][pool] [source/common/http/http1/conn_pool.cc:95] creating a new connection
[2020-01-06 22:21:55.881][147][debug][client] [source/common/http/codec_client.cc:31] [C1092] connecting
[2020-01-06 22:21:55.881][147][debug][connection] [source/common/network/connection_impl.cc:711] [C1092] connecting to 10.110.52.34:80
[2020-01-06 22:21:55.881][147][debug][connection] [source/common/network/connection_impl.cc:720] [C1092] connection in progress
[2020-01-06 22:21:55.881][147][debug][pool] [source/common/http/conn_pool_base.cc:20] queueing request due to no available connections
[2020-01-06 22:21:55.881][147][debug][connection] [source/common/network/connection_impl.cc:568] [C1092] delayed connection error: 111
[2020-01-06 22:21:55.881][147][debug][connection] [source/common/network/connection_impl.cc:193] [C1092] closing socket: 0
[2020-01-06 22:21:55.881][147][debug][client] [source/common/http/codec_client.cc:88] [C1092] disconnect. resetting 0 pending requests
[2020-01-06 22:21:55.881][147][debug][pool] [source/common/http/http1/conn_pool.cc:136] [C1092] client disconnected, failure reason: 
[2020-01-06 22:21:55.881][147][debug][pool] [source/common/http/http1/conn_pool.cc:167] [C1092] purge pending, failure reason: 
[2020-01-06 22:21:55.881][147][debug][router] [source/common/router/router.cc:911] [C1084][S12840280777660819839] upstream reset: reset reason connection failure
[2020-01-06 22:21:55.881][147][debug][http] [source/common/http/conn_manager_impl.cc:1354] [C1084][S12840280777660819839] Sending local reply with details upstream_reset_before_response_started{connection failure}

This seems to only happen for traffic originating from within the VPC. There are no errors when hitting the services directly via their ClusterIPs and there are no errors for requests originating outside the VPC like customer API calls and external health monitoring that hit the same endpoints.

My next steps are to see if the EA release of Ambassador Edge Stack fixes it. If that doesn’t work, I’ll start analyzing packet captures between the pods.

1reaction
kphelpscommented, Dec 10, 2019

I’ve done some additional digging into this with @joshbranham. It seems increasing the gunicorn keepalive setting significantly reduces the frequency of these 503s (although we still see one occasionally). I’m guessing there is some interaction between gunicorn closing connections and Envoy’s connection pool causing these 503s. Thoughts? I’d be happy to open a PR if this is something other people are running into.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot 503 errors from Application Load Balancer
I get an HTTP 503 (Service unavailable) error when using an Application Load Balancer (ALB). How can I resolve this error?
Read more >
Troubleshooting HTTP 503 errors returned when using a ...
Elastic Load Balancing detects unhealthy instances and routes traffic only to healthy instances. This blog discusses the troubleshooting steps ...
Read more >
check_alive 503 with ALB · Issue #2091 · emissary-ingress ...
Describe the bug We are using an ALB alongside a NodePort Service to route traffic to our Ambassador pods. We are seeing a...
Read more >
AWS ALB-ECS 503 Service Unavailable - terraform
I was able to fix this. The issue was the containers were not starting up due to a misconfigured log group.
Read more >
amazon web services - 503 ALB health check HAProxy
Continuous delivery, meet continuous security · Taking stock of crypto's crash · Inbox improvements are live · Help us identify new roles for ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found