Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

L5d fails with DynBoundTimeoutException after target service is updated

See original GitHub issue

Thanks for your help improving the project!

Getting Help

Github issues are for bug reports and feature requests. For questions about Linkerd, how to use it, or debugging assistance, start by asking a question in the forums or join us on Slack.

Full details at CONTRIBUTING.md.

Filing a Linkerd issue

Issue Type:

Bug report
Feature request

What happened: When k8s deployment is applied and new version is installed, L5d fails with “E 0720 15:10:25.148 UTC THREAD10: service failure: com.twitter.finagle.naming.buoyant.DynBoundTimeoutException: Exceeded 10.seconds binding timeout while connecting to /#/io.l5d.k8s/default/http/my_service_name for name: /svc/my_service_name”

What you expected to happen: L5d should update new service endpoints and send traffic to new pods

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?: k8s deployed l5d and target service servers config for usedroute:

  servers:
  - port: XXXX
    ip: 0.0.0.0
    clearContext: true
  bindingCache:
    paths: 100
    trees: 100
    bounds: 100
    clients: 10
    idleTtlSecs: 5

rollingupdate details:

  replicas: 2
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 50%
  minReadySeconds: 20

Environment:

linkerd/namerd version, config files: 1.4.2
Platform, version, and config files (Kubernetes, DC/OS, etc): kubernetes in AWS (not EKS)
Cloud provider or hardware configuration:

same issue was in December

Issue Analytics

State:
Created 5 years ago
Reactions:20
Comments:29 (16 by maintainers)

Top GitHub Comments

1reaction

chrisgoffinetcommented, Oct 24, 2018

Update as I started looking into this issue. I can easily reproduce this now. I’ve confirmed the K8s endpoint notifications do get sent to Linkerd on destroy and create. It’s the client_state.json that’s showing the staleness. Now let me also explain why it’s kind of hard to catch this running on say your laptop. I noticed that if you just start destroying pods, they will start back up really quickly, and reuse the same IP address, so if you’re trying to hit this case it looks like no bug. The client_state.json is technically stale.

It wasn’t until I modified my deployment in K8s to inject an init container that would add a 30s sleep on pod creation, that we can see this bug surface easily. I noticed that it never seems to recover, unless you stop all traffic and let the 10m idle timeout kick in, which destroys all the state.

Now that this is easily repo now, I should be able to track down where in the code we’re missing this.

0reactions

sahilbadla27commented, Nov 8, 2018

@adleong’s fix(v1.5.1-stab) looks promising. Its deployed in dev cluster for a day now and haven’t seen any issue.