Linkerd fails to route to kubernetes service seemingly after persistent k8s watch connection dies
See original GitHub issueIssue Type:
- Bug report
- Feature request
What happened:
After upgrading to 1.3.0, occasionally linkerd will fail to route to a k8s service after a rolling update. The k8s deployments all specify the following:
replicas: <n replicas>
strategy:
rollingUpdate:
maxUnavailable: 0
maxSurge: 2
to guarantee that there are always n
healthy pods running during the rolling update. When the issue occurs, kubectl -n <namespace> get pods
shows that the pods are ready. Looking at the logs, it seems to be marking healthy services as being dead. It seems to only happen after the persistent watch connection w/kubectl proxy
dies.
The issue resolves itself after a while (usually minutes?), but causes downtime until it does.
kubectl andTRACE
-level linkerd logs: https://gist.github.com/agunnerson-ibm/6384e35b9e1239ae1524e5894b1c8e47
Specifically, the issue was seen at 20:26:23 UTC for redacted-service-1
in redacted-namespace-1
.
What you expected to happen:
linkerd should not fail to route to a healthy service.
How to reproduce it (as minimally and precisely as possible):
It seems to happen when several people are deploying and hitting services, but I’ve been unable to reproduce it at will.
Anything else we need to know?:
It’s possible this might have also affected the 1.2.x releases. If it did, it would’ve been masked by other issues we were encountering.
Environment:
- linkerd 1.3.0 in k8s daemonset configuration.
- dtab:
Requests to services in other namespaces are done by passing the/k8s => /#/io.l5d.k8s; /ns => /k8s/master; /srv => /ns/http; /svc => /srv;
l5d-dtab: /ns => /k8s/some-namespace
header. - kubernetes 1.6.7 in HA setup with 3 masters. Haproxy sits in front of the masters and will terminate connections after half an hour.
Issue Analytics
- State:
- Created 6 years ago
- Comments:7 (4 by maintainers)
Top GitHub Comments
Just ran into this again. Logs: https://gist.github.com/agunnerson-ibm/344cd7f508bec2e5032e065d4d55f6cc
Requests were failing at around 16:15 UTC (line 26656 in the logs) and recovered at around 16:22 UTC. This is for
redacted-service-1
in themaster
namespace.Thank you! I’ll give a try in our QA environments on Monday.