Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Linkerd fails to route to kubernetes service seemingly after persistent k8s watch connection dies

See original GitHub issue

Issue Type:

Bug report
Feature request

What happened:

After upgrading to 1.3.0, occasionally linkerd will fail to route to a k8s service after a rolling update. The k8s deployments all specify the following:

  replicas: <n replicas>
  strategy:
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 2

to guarantee that there are always n healthy pods running during the rolling update. When the issue occurs, kubectl -n <namespace> get pods shows that the pods are ready. Looking at the logs, it seems to be marking healthy services as being dead. It seems to only happen after the persistent watch connection w/kubectl proxy dies.

The issue resolves itself after a while (usually minutes?), but causes downtime until it does.

kubectl andTRACE-level linkerd logs: https://gist.github.com/agunnerson-ibm/6384e35b9e1239ae1524e5894b1c8e47

Specifically, the issue was seen at 20:26:23 UTC for redacted-service-1 in redacted-namespace-1.

What you expected to happen:

linkerd should not fail to route to a healthy service.

How to reproduce it (as minimally and precisely as possible):

It seems to happen when several people are deploying and hitting services, but I’ve been unable to reproduce it at will.

Anything else we need to know?:

It’s possible this might have also affected the 1.2.x releases. If it did, it would’ve been masked by other issues we were encountering.

Environment:

linkerd 1.3.0 in k8s daemonset configuration.
dtab:
```
/k8s => /#/io.l5d.k8s;
/ns  => /k8s/master;
/srv => /ns/http;
/svc => /srv;
```
Requests to services in other namespaces are done by passing the l5d-dtab: /ns => /k8s/some-namespace header.
kubernetes 1.6.7 in HA setup with 3 masters. Haproxy sits in front of the masters and will terminate connections after half an hour.

Issue Analytics

State:
Created 6 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

agunnerson-ibmcommented, Oct 10, 2017

Just ran into this again. Logs: https://gist.github.com/agunnerson-ibm/344cd7f508bec2e5032e065d4d55f6cc

Requests were failing at around 16:15 UTC (line 26656 in the logs) and recovered at around 16:22 UTC. This is for redacted-service-1 in the master namespace.

0reactions

agunnerson-ibmcommented, Oct 20, 2017

Thank you! I’ll give a try in our QA environments on Monday.

Top Results From Across the Web

Linkerd stops sending traffic to grpc kubernetes pods

Hi,. I have been seen this behavior multiple times now. I am running linkerd:1.3.4 . The following is the full set of configuration:...

Up and Running With Linkerd v1 - VADOSWARE

Linkerd v1 is meant to run on every node (so a k8s DaemonSet is what we want), and handle traffic for services that...

eBPF will help solve service mesh by getting rid of sidecars

Since some number of persistent connections will get force terminated on scale down or node replacement events... Cilium and eBPF looks like a ......

Kubernetes in virtual reality: Building the K8s VR experience

I made a Kubernetes pod visualizer / interactive experience for the HTC Vive. You can try out the code from the following repos:...

Timeline of kubernetes events - Stack Overflow

Their "time-to-live" is apparently controlled by kube-apiserver ... Running kubectl get event -o yaml --watch into a persistent file sounds like a simple ......