Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

HTTP 504 Gateway timeouts after upgrading to linkerd 1.2.0

See original GitHub issue

Filing a linkerd issue

Issue Type: Bug Report

What happened: We have a rails application which serves web pages and it was running as a Kubernetes deployment. Every time a deployment is made and new pods come up, Linkerd gives 504 gateway timeout and when I checked Linkerd logs I can see that it was still making requests to the old endpoints. (Not sure if this is a config issue). It get’s fixed on its own after some time.

What you expected to happen: Endpoints should be instantly updated whenever there is a deployment.

How to reproduce it (as minimally and precisely as possible): Rails Puma Server serving web pages and making requests to the service just after a deployment.

Anything else we need to know?: Our router config:

    - protocol: http
      label: webapp-external
      identifier:
        kind: io.l5d.header.token
        header: Host
      interpreter:
        kind: io.l5d.namerd
        dst: /$/inet/namerd.linkerd.svc.cluster.local/4100
        namespace: external
        transformers:
        - kind: io.l5d.k8s.daemonset
          namespace: linkerd
          port: webapp-ingress
          service: linkerd-internal
      servers:
      - port: 4143
        ip: 0.0.0.0
      client:
        kind: io.l5d.global
        loadBalancer:
          kind: ewma
          enableProbation: false
          maxEffort: 5
          decayTimeMs: 10
        failureAccrual:
          kind: io.l5d.consecutiveFailures
          failures: 5
      service:
        kind: io.l5d.global

    - protocol: http
      label: webapp-ingress
      identifier:
        kind: io.l5d.header.token
        header: Host
      interpreter:
        kind: io.l5d.namerd
        dst: /$/inet/namerd.linkerd.svc.cluster.local/4100
        namespace: external
        transformers:
        - kind: io.l5d.k8s.localnode
      servers:
      - port: 4145
        ip: 0.0.0.0
      client:
        kind: io.l5d.global
        loadBalancer:
          kind: ewma
          enableProbation: false
          maxEffort: 5
          decayTimeMs: 10
        failureAccrual:
          kind: io.l5d.consecutiveFailures
          failures: 5
      service:
        kind: io.l5d.global

Environment:

linkerd/namerd version, config files: 1.2.0/1.2.0
Platform, version, and config files (Kubernetes, DC/OS, etc): Kubernetes
Cloud provider or hardware configuration: AWS

Issue Analytics

State:
Created 6 years ago
Comments:50 (25 by maintainers)

Top GitHub Comments

4reactions

hawkwcommented, Sep 20, 2017

@bseibel I’ve been looking into this some more today and I agree that this issue is almost certainly related to kubernetes/kubernetes#35068. That also explains why our unit tests haven’t caught this issue, as the tests for handling the “too old resource version” response set the response status code to 410.

2reactions

bseibelcommented, Oct 3, 2017

So unfortunately we’re still seeing this issue even with the fix here, we now see in debug logs

D 1003 19:20:25.158 UTC THREAD51 TraceId:9921f7129139749d: k8s returned 'too old resource version' error with incorrect HTTP status code, restarting watch

however where we do see lines like (and pardon my slightly filtered log lines without the endpoints):

E  D 1003 19:51:13.636 UTC THREAD65 TraceId:8b3991f3b0f04b0d: k8s ns default svc yarisgrmn constructed new ServiceEndpoints with:
E  D 1003 19:51:13.636 UTC THREAD65 TraceId:8b3991f3b0f04b0d: k8s ns default service yarisgrmn added port mappings
E  D 1003 19:51:13.636 UTC THREAD65 TraceId:8b3991f3b0f04b0d: k8s ns default service yarisgrmn added endpoints

for most pre-existing endpoints which is fine, and expected, but a service that was added after the restarting watch line doesn’t appear in the logs at all, and linkerd ends up giving us “No hosts are available”.

So far we’ve only seen this happen in production, but it happens pretty frequently, sometimes minutes after we kick our namerd pods. Linkerd isnt logging any issues about connectivity to namerd. I’m out of town at the moment and I’m going to try to narrow down the issue further when I’m back later this week, but if theres anything specific you would like me to poke at to help narrow down the issue please let me know.