Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Linkerd continues to talk to old endpoint after Kubernetes deployment

See original GitHub issue

Issue Type:

Bug report
Feature request

What happened:

I upgraded our Linkerd from 0.8.6 (I know, ancient, but we had no complaints) to 1.3.3. We are running a Kubernetes cluster, and this Linkerd receives traffic from the Internet via an ELB and routes to the appropriate services in various staging environments. I updated our configuration, and everything appeared to be working fine.

Here is our configuration. This upgraded linkerd is named beyond-thunderdome because the old one was thunderdome and it all makes perfect sense if you’ve watched Mad Max:

    admin:
      port: 9990
      # because it's getting hit by k8s for health checks
      ip: 0.0.0.0
    usage:
      enabled: false
    routers:
      - label: in-beyond-thunderdome
        protocol: http
        #httpAccessLog: /dev/stdout # not in production, por favor
        servers:
          - port: 8080
            # on purpose, in-pod nginx sidecar will be able to proxy to
            # localhost
            ip: 127.0.0.1 
            # this clears out linkerd headers that let you manipulate cluster
            # routing. It is ABSOLUTELY CRITICAL that we don't accept these
            # headers from the internet.
            clearContext: true
        interpreter:
          kind: io.l5d.k8s.configMap
          experimental: true # sure, why not
          name: linkerd-dtabs
          filename: beyond-thunderdome
          namespace: default
    namers:
      - kind: io.l5d.k8s
        host: 127.0.0.1
        port: 8001
    telemetry:
      - kind: io.l5d.prometheus
        path: /admin/metrics/prometheus
        prefix: linkerd_
      - kind: io.l5d.recentRequests
        sampleRate: 0.0001
        capacity: 20

slimmed-down dtab ConfigMap looks like this:

kind: ConfigMap
metadata:
  name: linkerd-dtabs
data:
  beyond-thunderdome: |-
    /host/com/example/integrations-dashboard => /#/io.l5d.k8s/integrations-staging/http/integrations-dashboard;
    /svc => /$/io.buoyant.http.domainToPathPfx/host ;

A couple of successful deploys were done after the upgrade, but I was notified that one of the services was down, returning 502s after a deployment. This is all low-traffic staging environments, so both the linkerd and the target service are scaled to 1. The service is called integrations-dashboard in the namespace integrations-staging. For the sake of this issue, the address is integrations-dashboard.example.com.

Our logs were full of errors like:

E 1206 01:05:20.259 UTC THREAD28: service failure: Failure(No route to host: /100.96.65.31:3000 at remote address: /100.96.65.31:3000. Remote Info: Not Available, flags=0x09) with RemoteInfo -> Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: /100.96.65.31:3000, Downstream label: #/io.l5d.k8s/integrations-staging/http/integrations-dashboard, Trace Id: d1d6dc90db15ca53.d1d6dc90db15ca53<:d1d6dc90db15ca53

Inspecting the Kubernetes service, the address 100.96.65.31 was incorrect. I can’t say for sure, but I believe this was the internal address of the service prior to their deployment. Interestingly, if I went to the Linkerd admin console dtab playground and put /svc/integrations-staging.example.com it returned the correct internal address. The Prometheus metrics show a failure accrual removal that corresponded to the exact time they performed a deploy.

We tried another deployment and it didn’t fix anything. I enabled verbose logging, deployed again, and saw this:

D 1206 03:04:31.913 UTC THREAD28: k8s ns integrations-staging service integrations-dashboard modified endpoints
D 1206 03:04:49.014 UTC THREAD28: k8s ns integrations-staging service integrations-dashboard modified endpoints                                                                   
D 1206 03:04:49.015 UTC THREAD28: k8s ns integrations-staging service integrations-dashboard added Endpoint(/100.96.63.61,Some(ip-172-20-100-132.us-west-2.compute.internal))     
D 1206 03:04:49.113 UTC THREAD28: k8s ns integrations-staging service integrations-dashboard modified endpoints
D 1206 03:04:49.114 UTC THREAD28: k8s ns integrations-staging service integrations-dashboard removed Endpoint(/100.97.4.48,Some(ip-172-20-61-49.us-west-2.compute.internal))      
E 1206 03:05:02.996 UTC THREAD31: service failure: Failure(connection timed out: /100.96.65.31:3000 at remote address: /100.96.65.31:3000. Remote Info: Not Available, flags=0x09) with RemoteInfo -> Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: /100.96.65.31:3000, Downstream label: #/io.l5d.k8s/integrations-staging/http/integrations-dashboard, Trace Id: d1d6dc90db15ca53.d1d6dc90db15ca53<:d1d6dc90db15ca53                                                                                             
I 1206 03:05:10.192 UTC THREAD28: FailureAccrualFactory marking connection to "#/io.l5d.k8s/integrations-staging/http/integrations-dashboard" as dead. Remote Address: Inet(/100.96.65.31:3000,Map(nodeName -> ip-172-20-63-116.us-west-2.compute.internal))
E 1206 03:05:10.194 UTC THREAD28: service failure: Failure(connection timed out: /100.96.65.31:3000 at remote address: /100.96.65.31:3000. Remote Info: Not Available, flags=0x09) with RemoteInfo -> Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: /100.96.65.31:3000, Downstream label: #/io.l5d.k8s/integrations-staging/http/integrations-dashboard, Trace Id: d1d6dc90db15ca53.d1d6dc90db15ca53<:d1d6dc90db15ca53

It seems like it is seeing the changing endpoints correctly, but it’s fixated on talking to the old endpoint and refuses to give up and move on.

I tried to troubleshoot the problem, but eventually to get the service back up I just restarted linkerd and everything immediately started working.

What you expected to happen:

I expected it to remove the old endpoint from its list when the Kubernetes service told it to.

How to reproduce it (as minimally and precisely as possible):

I wish I knew. I will endeavor to try and reproduce the problem. Like I said, we’ve had a couple of identical deploys occur without an issue.

Anything else we need to know?:

Nope. Thanks for reading this enormous ticket.

Environment:

Linkerd 1.3.3
Kubernetes 1.7.2
Cloud provider or hardware configuration: AWS

Issue Analytics

State:
Created 6 years ago
Comments:60 (43 by maintainers)

Top GitHub Comments

2reactions

hawkwcommented, Feb 19, 2018

Just wanted to give you all a little update: we’ve finally seen this type of failure in our test cluster, and that’s helped to provide a little more insight into where the bug might be. I’m still investigating the issue, but there has been some progress!

1reaction

dpetersencommented, Jan 24, 2018

@siggy I don’t, sorry. Since I last commented on this I haven’t had time to revisit it, and we’re continuing to run an older version of Linkerd. Definitely willing to experiment again if a new version has a prospective fix or just additional logging.