Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

After several hours, envoy returns all 503's for all clusters, seems related to "complete reconfigure" (update: happens on consul certificate rotation)

See original GitHub issue

Describe the bug

Ambassador pod starts returning all 503’s (all clusters have unhealthy upstreams) after several hours of running with healthy upstreams
Using ConsulResolver, AES 1.13.10, AMBASSADOR_AMBEX_NO_RATELIMIT
possibly has to do with “complete reconfigure”, based on seeing logs that correlate with the problematic times
doesn’t seem related to high memory as this can occur when Ambassador pod is in the 40’s % memory usage

To Reproduce

UPDATE: disregard the below bullets (the original given steps) and instead see the steps in the 2nd comment
~Deploy Ambassador 1.13.10 with AMBASSADOR_AMBEX_NO_RATELIMIT=true using the ConsulResolver with a Consul Connect Mesh~
~See that everything is working fine; clusters are sending traffic to pods as expected, any app pod terminations / additions are recognized smoothly by Ambassador~
~Wait several hours, perhaps even 2 days~
~See that requests to apps start coming back as “503 UH” for some hours, but then go back to working fine again~
~Note that the logs will show a “complete reconfigure” very close to the beginning and end of the unhealthy upstreams period~

Expected behavior

The requests continue to serve to the pods successfully indefinitely - i.e. the upstreams to do not go unhealthy

Versions (please complete the following information):

Ambassador: 1.13.10
Kubernetes environment: AWS EKS
Version 1.18
Consul Connect: 1.9.3

Additional context We use the ConsulResolver and recently upgraded to 1.13.10 (from 1.13.0) as well as enabled AMBASSADOR_AMBEX_NO_RATELIMIT since we’d like to try suppressing the reconfigure throttling. We tried this out in a test environment, and the feature worked as we hoped in that high memory on the ambassador pod did not cause the “throttling” message in the logs and also did not slow down the recognition of changes to pods (which typically causes us a lot of 503’s), so that’s all good.

However, once we deployed this to the initial stages of our real environments, after many hours of running we started to get a lot of 503’s. In one of those environments, it was clear from metrics/logs that all the 503’s were coming from 1 of the 3 ambassador pods. When I hopped into the container and checked the clusters’s metrics, it appeared that all the clusters were empty - that is - there were no IP addresses in the metrics under /clusters. (Typically when debugging, if I want to explicitly check targets for a particular pod’s cluster, I run: curl -s http://127.0.0.1:8001/clusters | grep <app-name> )

What facts do we have so far?

This problem has, thus far, always taken somewhere between several hours and multiple days to occur.
We have never seen this happen with 1.13.0 or earlier.
We have seen this problem occur in multiple environments, so there is repeatability (albeit slow to reproduce)
This problem does not (at least) seem to be related to high memory; we have seen it happen in the 40’s % of ambassador pod memory usage
We are not sure whether it has to do with the 1.13.0 to 1.13.10 upgrade alone or if it also requires the enabling of AMBASSADOR_AMBEX_NO_RATELIMIT.
- To try and gain some evidence, we have two environments running 1.13.10, one with it disabled and the other with it enabled, but it is slow to reproduce the error, so we are waiting to find out something.
This problem fixes itself after a while - something like 3-7 hours
Biggest Clue So Far: “complete reconfigure”
- whenever this issue has occurred, we have seen logs indicating a “complete reconfigure” just as the problem starts and then a “complete reconfigure” again right about the time the problem ends.
- However, there are other times the “complete reconfigure” logs happen where there is no negative impact
- here is an example of a set of such logs just before the first 503 (see 07:31:32) and just after the last 503 (see 13:23:27)

What Troubleshooting We Have Tried So Far

Setting log level to “debug”, but haven’t found anything else that gives us any more clues in those logs so far
Briefly setting log level to “trace” during a problem window. We can’t leave it on trace for long enough to initially catch the error due to the volume of logging, but I did enable it on a problematic instances for a few minutes hoping I might see something going wrong with service discovery, but didn’t manage to find anything helpful yet.
digging through the /ambassador/snapshot/*.yaml and *.json files from the impacted container.
- For instance, I went through each of the ambex-*.json files thinking I might see one where the clusters.<app’s-cluster>.targets was empty, but they did have valid targets in them.
- E.g. {"ip": "10.131.142.176", "port": 20000, "target_kind": "Consul"}

Questions

Is there anything relevant to know about the “complete reconfigure” log occurrences? Are they triggered by something? Are they indicative of something amiss?
Does anyone have any other ideas on how to troubleshoot this issue or get more info?