Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

linkerd stops routing HTTP traffic to external DNS names

See original GitHub issue

Thanks for your help improving the project!

Getting Help

Github issues are for bug reports and feature requests. For questions about Linkerd, how to use it, or debugging assistance, start by asking a question in the forums or join us on Slack.

Full details at CONTRIBUTING.md.

Filing a Linkerd issue

Issue Type:

Bug report
Feature request

What happened: After some time linkerd stops routing traffic to external DNS names with

“E 0812 09:45:11.218 UTC THREAD11 TraceId:b86af61cbd38ae05: service failure: com.twitter.finagle.naming.buoyant.DynBoundTimeoutException: Exceeded 30.seconds binding timeout while resolving name: /svc/google.com”

and

“I 0812 09:45:34.190 UTC THREAD11: Reaping /svc/google.com”

Routing to internal services works fine.

Additional symptom:

Delegator webpage starts to load without DTAB form and with message

“The request to namerd has timed out. Please ensure your config is correct and try again.”

Restart of linkerd restores functioning of routing to external services and Delegator webpage

What you expected to happen: linkerd shouldn’t require restart to route traffic to external services

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?: linkerd and affected services are deployed in k8s cluster router configuration:

    - protocol: http
      label: http-outgoing
      maxRequestKB: 20480
      maxResponseKB: 20480
      httpAccessLog: /var/log/linkerd/l5d-http-outgoing-access.log
      client:
        failureAccrual:
          kind: none
      interpreter:
        kind: io.l5d.k8s.configMap
        experimental: true
        name: l5d-dtabs-config
        filename: http-outgoing
        namespace: servicemesh
      servers:
      - port: 4140
        ip: 0.0.0.0
      bindingTimeoutMs: 30000
      bindingCache:
        paths: 100
        trees: 100
        bounds: 100
        clients: 10
        idleTtlSecs: 5

DTAB used:

  http-outgoing: |-
        /ph        => /$/io.buoyant.rinet ;                     # /ph/80/google.com -> /$/io.buoyant.rinet/80/google.com
        /svc       => /ph/80 ;                                  # /svc/google.com -> /ph/80/google.com
        /svc       => /$/io.buoyant.porthostPfx/ph ;            # /svc/google.com:80 -> /ph/80/google.com
        /k8s       => /#/io.l5d.k8s ;                           # /k8s/default/http/foo -> /#/io.l5d.k8s.http/default/http/foo
        /portNsSvc => /#/portNsSvcToK8s ;                       # /portNsSvc/http/default/foo -> /k8s/default/http/foo
        /host      => /portNsSvc/http/default ;                 # /host/foo -> /portNsSvc/http/default/foo
        /host      => /portNsSvc/http ;                         # /host/default/foo -> /portNsSvc/http/default/foo
        /svc       => /$/io.buoyant.http.domainToPathPfx/host ; # /svc/foo.default -> /host/default/foo

Environment:

linkerd/namerd version, config files: linkerd 1.6.4; namerd is not used
Platform, version, and config files (Kubernetes, DC/OS, etc): Kops created k8s cluster in AWS (1.11.9, 1.12.7)
Cloud provider or hardware configuration: AWS

Issue Analytics

State:
Created 4 years ago
Reactions:2
Comments:20 (14 by maintainers)

Top GitHub Comments

1reaction

zaharidichevcommented, Sep 12, 2019

@valerii-grachov Thanks a lot for the detailed explanation of the issue and the provided evidence. We are still trying to deterministically reproduce the problem. By the looks of it a cache that should not evict under any circumstances is evicting one of these watchers and closing it premmaturely. When we manage to reproduce the problem and catch why this is happening, we will push out a fix.

0reactions

cpretzercommented, Jan 7, 2020

@j0sh3rs I’m going to close this ticket for now. Please reopen it if you get more details about the behavior and how to reproduce it.