question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

After several hours, envoy returns all 503's for all clusters, seems related to "complete reconfigure" (update: happens on consul certificate rotation)

See original GitHub issue

Describe the bug

  • Ambassador pod starts returning all 503’s (all clusters have unhealthy upstreams) after several hours of running with healthy upstreams
  • Using ConsulResolver, AES 1.13.10,  AMBASSADOR_AMBEX_NO_RATELIMIT
  • possibly has to do with “complete reconfigure”, based on seeing logs that correlate with the problematic times
  • doesn’t seem related to high memory as this can occur when Ambassador pod is in the 40’s % memory usage

To Reproduce

  • UPDATE: disregard the below bullets (the original given steps) and instead see the steps in the 2nd comment
  • ~Deploy Ambassador 1.13.10 with AMBASSADOR_AMBEX_NO_RATELIMIT=true using the ConsulResolver with a Consul Connect Mesh~
  • ~See that everything is working fine; clusters are sending traffic to pods as expected, any app pod terminations / additions are recognized smoothly by Ambassador~
  • ~Wait several hours, perhaps even 2 days~
  • ~See that requests to apps start coming back as “503 UH” for some hours, but then go back to working fine again~
  • ~Note that the logs will show a “complete reconfigure” very close to the beginning and end of the unhealthy upstreams period~

Expected behavior

  • The requests continue to serve to the pods successfully indefinitely - i.e. the upstreams to do not go unhealthy

Versions (please complete the following information):

  • Ambassador: 1.13.10
  • Kubernetes environment: AWS EKS
  • Version 1.18
  • Consul Connect: 1.9.3

Additional context We use the ConsulResolver and recently upgraded to 1.13.10 (from 1.13.0) as well as enabled AMBASSADOR_AMBEX_NO_RATELIMIT since we’d like to try suppressing the reconfigure throttling. We tried this out in a test environment, and the feature worked as we hoped in that high memory on the ambassador pod did not cause the “throttling” message in the logs and also did not slow down the recognition of changes to pods (which typically causes us a lot of 503’s), so that’s all good.

However, once we deployed this to the initial stages of our real environments, after many hours of running we started to get a lot of 503’s. In one of those environments, it was clear from metrics/logs that all the 503’s were coming from 1 of the 3 ambassador pods. When I hopped into the container and checked the clusters’s metrics, it appeared that all the clusters were empty - that is - there were no IP addresses in the metrics under /clusters. (Typically when debugging, if I want to explicitly check targets for a particular pod’s cluster, I run: curl -s http://127.0.0.1:8001/clusters | grep <app-name> )

What facts do we have so far?

  • This problem has, thus far, always taken somewhere between several hours and multiple days to occur.
  • We have never seen this happen with 1.13.0 or earlier.
  • We have seen this problem occur in multiple environments, so there is repeatability (albeit slow to reproduce)
  • This problem does not (at least) seem to be related to high memory; we have seen it happen in the 40’s % of ambassador pod memory usage
  • We are not sure whether it has to do with the 1.13.0 to 1.13.10 upgrade alone or if it also requires the enabling of AMBASSADOR_AMBEX_NO_RATELIMIT.
    • To try and gain some evidence, we have two environments running 1.13.10, one with it disabled and the other with it enabled, but it is slow to reproduce the error, so we are waiting to find out something.
  • This problem fixes itself after a while - something like 3-7 hours
  • Biggest Clue So Far: “complete reconfigure”
    • whenever this issue has occurred, we have seen logs indicating a “complete reconfigure” just as the problem starts and then a “complete reconfigure” again right about the time the problem ends.
    • However, there are other times the “complete reconfigure” logs happen where there is no negative impact
    • here is an example of a set of such logs just before the first 503 (see 07:31:32) and just after the last 503 (see 13:23:27) complete_reconfigure_logs_1

What Troubleshooting We Have Tried So Far

  • Setting log level to “debug”, but haven’t found anything else that gives us any more clues in those logs so far
  • Briefly setting log level to “trace” during a problem window. We can’t leave it on trace for long enough to initially catch the error due to the volume of logging, but I did enable it on a problematic instances for a few minutes hoping I might see something going wrong with service discovery, but didn’t manage to find anything helpful yet.
  • digging through the /ambassador/snapshot/*.yaml and *.json files from the impacted container.
    • For instance, I went through each of the ambex-*.json files thinking I might see one where the clusters.<app’s-cluster>.targets was empty, but they did have valid targets in them.
    • E.g. {"ip": "10.131.142.176", "port": 20000, "target_kind": "Consul"}

Questions

  • Is there anything relevant to know about the “complete reconfigure” log occurrences? Are they triggered by something? Are they indicative of something amiss?
  • Does anyone have any other ideas on how to troubleshoot this issue or get more info?

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
woz5999commented, Mar 7, 2022

We’re experiencing this issue as well. It’s a huge pain and giant limitation using ambassador alongside consul

0reactions
cindymullins-dwcommented, Sep 22, 2022

Addressed in Version 2.4.0. We’d appreciate your feedback on whether this resolves the issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Envoy Proxy breaks when enabling Consul TLS #7926 - GitHub
After enable TLS on consul, everything appears to work, however services that talk to the envoy proxy fail to pass data.
Read more >
Rotating Consul TLS Certificates - Hashicorp Support
All certificates have an expiry date upon which need to be updated with the valid certificate. However, the process for rotating the ...
Read more >
hashicorp-consul/Lobby - Gitter
I'm having a TCP keep-alive issued idle connections get disconnected after an hour seems to be related to idle_timeout, and I want to...
Read more >
Getting into HashiCorp Consul, Part 8 - YouTube
Cole Morrison and Rosemary Wang (Developer Advocates at HashiCorp) learn Consul the hard way by setting it from scratch.
Read more >
Configure Consul as HTTPS - TechDocs - Broadcom Inc.
Configure Consul as HTTPS. Last Updated October 18, 2022. You manage · Go to the. /daproxy/conf · Generate the Consul agent certificate files...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found