Ambassador sends unexpected 503 on check_alive endpoint at startup
See original GitHub issueDescribe the bug
Ambassador live probe sends “ambassador seems to have died” during k8s watch initializations.
I’ve put a simple bash script during the initialization with a kubectl port-forward to evaluate how long it lasts :
kubectl port-forward <pod> 8877
while [ true ]
do
date
curl 127.0.0.1:8877/ambassador/v0/check_alive
curl 127.0.0.1:8877/ambassador/v0/check_ready
sleep 1
done
Here the post-mortem :
Thu 29 Aug 2019 22:59:19 EDT
ambassador liveness check OK (18 seconds)
ambassador waiting for config
Thu 29 Aug 2019 22:59:21 EDT
ambassador seems to have died (20 seconds)
ambassador waiting for config
Thu 29 Aug 2019 22:59:22 EDT
ambassador seems to have died (21 seconds)
ambassador waiting for config
[...]
Thu 29 Aug 2019 22:59:40 EDT
ambassador seems to have died (39 seconds)
ambassador not ready (Never updated)
Thu 29 Aug 2019 22:59:42 EDT
ambassador seems to have died (41 seconds)
ambassador not ready (Never updated)
Thu 29 Aug 2019 22:59:43 EDT
ambassador liveness check OK (42 seconds)
Ambassador is down during 24 s, and it’s a problem because in your official helmchart, the liveness probe initial delay is only 30s (at this point, ambassador has already been started for 18s now, creating an infinite crash loop).
The failure of the live probe seems to correspond with the scan of the k8s watches, when Ambassador wait for them to start. It just stops answering in the middle of the procedure. Here some logs at 59:23
2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
Then it comes back up at 59:43, when all the watch are up to date :
[2019-08-30 02:59:39.559][240][info][upstream] [source/server/lds_api.cc:60] lds: add/update listener 'ambassador-listener-8443'
2019-08-30 02:59:39 diagd 0.73.0 [P155TAmbassadorEventWatcher] WARNING: Scout: could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f6a98d51630>, 'Connection to kubernaut.io timed out. (connect timeout=1)'))
2019-08-30 02:59:39 diagd 0.73.0 [P155TAmbassadorEventWatcher] INFO: Scout reports {"latest_version": "0.73.0", "exception": "could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f6a98d51630>, 'Connection to kubernaut.io timed out. (connect timeout=1)'))", "cached": false, "timestamp": 1567133979.723536}
2019-08-30 02:59:39 diagd 0.73.0 [P155TAmbassadorEventWatcher] INFO: Scout notices: []
To Reproduce
Steps to reproduce the behavior:
- Create a service like this on your cluster.
apiVersion: v1
kind: Service
metadata:
annotations:
getambassador.io/config: |
apiVersion: ambassador/v1
kind: Mapping
ambassador_id: production
name: my_service_mapping_number_n
prefix: "/"
service: service_n.namespace_n.svc.cluster.local
host: service_n.domain.com
---
apiVersion: ambassador/v1
kind: TLSContext
name: my_service_context_number_n
ambassador_id: production
hosts:
- service_n.domain.com
secret: my-service-certificate.namespace_n
- Create a lot of them, like at least 15 different services.
Expected behavior
Ambassador should not say it dies during initialization when obviously it’s not the case, it should just say it is not ready.
This is painfull because in the helm chart, the initial delay is set to 30s.
Our ambassador is a shared service for us, with a lot of apps on it, so it takes more time than that to start, and the live probe is able to catch the 503 errors described above.
Then kubernetes kills the pod and it’s the infinite crash loop.
Versions
- Ambassador: 0.73
- Kubernetes environment: Google Kubernetes Engine
- Version: 1.12.7
Additional context
We have a lot of services attached to this ambassador, and unfortunately it makes this bug bigger than it really is. The fact that ambassador has to load a huge amount of certificate doesn’t help, but the security team in the company don’t want to give wild cards, and we can’t afford to manage one ambassador per team.
Issue Analytics
- State:
- Created 4 years ago
- Comments:5
Top GitHub Comments
Probably also related to #1661. I know it works by increasing the waiting time, but this shouldn’t be something to be done manually (in my case I’m using the chart).
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.