question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Linkerd fails to route to kubernetes service seemingly after persistent k8s watch connection dies

See original GitHub issue

Issue Type:

  • Bug report
  • Feature request

What happened:

After upgrading to 1.3.0, occasionally linkerd will fail to route to a k8s service after a rolling update. The k8s deployments all specify the following:

  replicas: <n replicas>
  strategy:
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 2

to guarantee that there are always n healthy pods running during the rolling update. When the issue occurs, kubectl -n <namespace> get pods shows that the pods are ready. Looking at the logs, it seems to be marking healthy services as being dead. It seems to only happen after the persistent watch connection w/kubectl proxy dies.

The issue resolves itself after a while (usually minutes?), but causes downtime until it does.

kubectl andTRACE-level linkerd logs: https://gist.github.com/agunnerson-ibm/6384e35b9e1239ae1524e5894b1c8e47

Specifically, the issue was seen at 20:26:23 UTC for redacted-service-1 in redacted-namespace-1.

What you expected to happen:

linkerd should not fail to route to a healthy service.

How to reproduce it (as minimally and precisely as possible):

It seems to happen when several people are deploying and hitting services, but I’ve been unable to reproduce it at will.

Anything else we need to know?:

It’s possible this might have also affected the 1.2.x releases. If it did, it would’ve been masked by other issues we were encountering.

Environment:

  • linkerd 1.3.0 in k8s daemonset configuration.
  • dtab:
    /k8s => /#/io.l5d.k8s;
    /ns  => /k8s/master;
    /srv => /ns/http;
    /svc => /srv;
    
    Requests to services in other namespaces are done by passing the l5d-dtab: /ns => /k8s/some-namespace header.
  • kubernetes 1.6.7 in HA setup with 3 masters. Haproxy sits in front of the masters and will terminate connections after half an hour.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
agunnerson-ibmcommented, Oct 10, 2017

Just ran into this again. Logs: https://gist.github.com/agunnerson-ibm/344cd7f508bec2e5032e065d4d55f6cc

Requests were failing at around 16:15 UTC (line 26656 in the logs) and recovered at around 16:22 UTC. This is for redacted-service-1 in the master namespace.

0reactions
agunnerson-ibmcommented, Oct 20, 2017

Thank you! I’ll give a try in our QA environments on Monday.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Linkerd stops sending traffic to grpc kubernetes pods
Hi,. I have been seen this behavior multiple times now. I am running linkerd:1.3.4 . The following is the full set of configuration:...
Read more >
Up and Running With Linkerd v1 - VADOSWARE
Linkerd v1 is meant to run on every node (so a k8s DaemonSet is what we want), and handle traffic for services that...
Read more >
eBPF will help solve service mesh by getting rid of sidecars
Since some number of persistent connections will get force terminated on scale down or node replacement events... Cilium and eBPF looks like a ......
Read more >
Kubernetes in virtual reality: Building the K8s VR experience
I made a Kubernetes pod visualizer / interactive experience for the HTC Vive. You can try out the code from the following repos:...
Read more >
Timeline of kubernetes events - Stack Overflow
Their "time-to-live" is apparently controlled by kube-apiserver ... Running kubectl get event -o yaml --watch into a persistent file sounds like a simple ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found