Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ambassador sends unexpected 503 on check_alive endpoint at startup

See original GitHub issue

Describe the bug

Ambassador live probe sends “ambassador seems to have died” during k8s watch initializations.

I’ve put a simple bash script during the initialization with a kubectl port-forward to evaluate how long it lasts :

kubectl port-forward <pod> 8877

while [ true ]
do
    date
	curl 127.0.0.1:8877/ambassador/v0/check_alive
	curl 127.0.0.1:8877/ambassador/v0/check_ready
	sleep 1
done

Here the post-mortem :

Thu 29 Aug 2019 22:59:19 EDT
ambassador liveness check OK (18 seconds)
ambassador waiting for config
Thu 29 Aug 2019 22:59:21 EDT
ambassador seems to have died (20 seconds)
ambassador waiting for config
Thu 29 Aug 2019 22:59:22 EDT
ambassador seems to have died (21 seconds)
ambassador waiting for config
[...]
Thu 29 Aug 2019 22:59:40 EDT
ambassador seems to have died (39 seconds)
ambassador not ready (Never updated)
Thu 29 Aug 2019 22:59:42 EDT
ambassador seems to have died (41 seconds)
ambassador not ready (Never updated)
Thu 29 Aug 2019 22:59:43 EDT
ambassador liveness check OK (42 seconds)

Ambassador is down during 24 s, and it’s a problem because in your official helmchart, the liveness probe initial delay is only 30s (at this point, ambassador has already been started for 18s now, creating an infinite crash loop).

The failure of the live probe seems to correspond with the scan of the k8s watches, when Ambassador wait for them to start. It just stops answering in the middle of the procedure. Here some logs at 59:23

2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: waiting for k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate
2019/08/30 02:59:23 aggregator: initialized k8s watch: secret|some-namespace|metadata.name=some-secret-with-certificate

Then it comes back up at 59:43, when all the watch are up to date :

[2019-08-30 02:59:39.559][240][info][upstream] [source/server/lds_api.cc:60] lds: add/update listener 'ambassador-listener-8443'
2019-08-30 02:59:39 diagd 0.73.0 [P155TAmbassadorEventWatcher] WARNING: Scout: could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f6a98d51630>, 'Connection to kubernaut.io timed out. (connect timeout=1)'))
2019-08-30 02:59:39 diagd 0.73.0 [P155TAmbassadorEventWatcher] INFO: Scout reports {"latest_version": "0.73.0", "exception": "could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f6a98d51630>, 'Connection to kubernaut.io timed out. (connect timeout=1)'))", "cached": false, "timestamp": 1567133979.723536}
2019-08-30 02:59:39 diagd 0.73.0 [P155TAmbassadorEventWatcher] INFO: Scout notices: []

To Reproduce

Steps to reproduce the behavior:

Create a service like this on your cluster.

apiVersion: v1
kind: Service
metadata:
  annotations:
    getambassador.io/config: |
      apiVersion: ambassador/v1
      kind:  Mapping
      ambassador_id: production
      name:  my_service_mapping_number_n
      prefix: "/"
      service: service_n.namespace_n.svc.cluster.local
      host: service_n.domain.com
      ---
      apiVersion: ambassador/v1
      kind: TLSContext
      name: my_service_context_number_n
      ambassador_id: production
      hosts:
      - service_n.domain.com
      secret: my-service-certificate.namespace_n

Create a lot of them, like at least 15 different services.

Expected behavior

Ambassador should not say it dies during initialization when obviously it’s not the case, it should just say it is not ready.

This is painfull because in the helm chart, the initial delay is set to 30s.

Our ambassador is a shared service for us, with a lot of apps on it, so it takes more time than that to start, and the live probe is able to catch the 503 errors described above.

Then kubernetes kills the pod and it’s the infinite crash loop.

Versions

Ambassador: 0.73
Kubernetes environment: Google Kubernetes Engine
Version: 1.12.7

Additional context

We have a lot of services attached to this ambassador, and unfortunately it makes this bug bigger than it really is. The fact that ambassador has to load a huge amount of certificate doesn’t help, but the security team in the company don’t want to give wild cards, and we can’t afford to manage one ambassador per team.

Issue Analytics

State:
Created 4 years ago
Comments:5

Top GitHub Comments

2reactions

hugocarrascocommented, Sep 24, 2019

Probably also related to #1661. I know it works by increasing the waiting time, but this shouldn’t be something to be done manually (in my case I’m using the chart).

0reactions

stale[bot]commented, Jan 28, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Top Results From Across the Web

Ambassador sends unexpected 503 on check_alive endpoint ...

Describe the bug. Ambassador live probe sends "ambassador seems to have died" during k8s watch initializations.

Debugging | Edge Stack

Set AES_LOG_LEVEL=debug to debug the early boot sequence, Ambassador Edge Stack's interactions with the Kubernetes cluster (finding changed resources, etc.), ...

FORGE Ambassadors

We are looking for a large, diverse network of students passionate about innovation and entrepreneurial thinking to represent FORGE across campus.

Global Ambassadors - The Startup Studio

Introducing our Global Ambassadors, highly accomplished professionals who provide the mentorship and guidance to our LTS participants.

How to Successfully Drive Student-Led Engagement

Step-by-step timeline for crafting a new student ambassador program. How ambassador programs can change the entire path of a student's journey and help...