L5d fails with DynBoundTimeoutException after target service is updated
See original GitHub issueThanks for your help improving the project!
Getting Help
Github issues are for bug reports and feature requests. For questions about Linkerd, how to use it, or debugging assistance, start by asking a question in the forums or join us on Slack.
Full details at CONTRIBUTING.md.
Filing a Linkerd issue
Issue Type:
- Bug report
- Feature request
What happened: When k8s deployment is applied and new version is installed, L5d fails with “E 0720 15:10:25.148 UTC THREAD10: service failure: com.twitter.finagle.naming.buoyant.DynBoundTimeoutException: Exceeded 10.seconds binding timeout while connecting to /#/io.l5d.k8s/default/http/my_service_name for name: /svc/my_service_name”
What you expected to happen: L5d should update new service endpoints and send traffic to new pods
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?: k8s deployed l5d and target service servers config for usedroute:
servers:
- port: XXXX
ip: 0.0.0.0
clearContext: true
bindingCache:
paths: 100
trees: 100
bounds: 100
clients: 10
idleTtlSecs: 5
rollingupdate details:
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 50%
minReadySeconds: 20
Environment:
- linkerd/namerd version, config files: 1.4.2
- Platform, version, and config files (Kubernetes, DC/OS, etc): kubernetes in AWS (not EKS)
- Cloud provider or hardware configuration:
Issue Analytics
- State:
- Created 5 years ago
- Reactions:20
- Comments:29 (16 by maintainers)
Top GitHub Comments
Update as I started looking into this issue. I can easily reproduce this now. I’ve confirmed the K8s endpoint notifications do get sent to Linkerd on destroy and create. It’s the
client_state.json
that’s showing the staleness. Now let me also explain why it’s kind of hard to catch this running on say your laptop. I noticed that if you just start destroying pods, they will start back up really quickly, and reuse the same IP address, so if you’re trying to hit this case it looks like no bug. Theclient_state.json
is technically stale.It wasn’t until I modified my deployment in K8s to inject an init container that would add a 30s sleep on pod creation, that we can see this bug surface easily. I noticed that it never seems to recover, unless you stop all traffic and let the 10m idle timeout kick in, which destroys all the state.
Now that this is easily repo now, I should be able to track down where in the code we’re missing this.
@adleong’s fix(v1.5.1-stab) looks promising. Its deployed in dev cluster for a day now and haven’t seen any issue.