Linkerd continues to talk to old endpoint after Kubernetes deployment
See original GitHub issueIssue Type:
- Bug report
- Feature request
What happened:
I upgraded our Linkerd from 0.8.6 (I know, ancient, but we had no complaints) to 1.3.3. We are running a Kubernetes cluster, and this Linkerd receives traffic from the Internet via an ELB and routes to the appropriate services in various staging environments. I updated our configuration, and everything appeared to be working fine.
Here is our configuration. This upgraded linkerd is named beyond-thunderdome
because the old one was thunderdome
and it all makes perfect sense if you’ve watched Mad Max:
admin:
port: 9990
# because it's getting hit by k8s for health checks
ip: 0.0.0.0
usage:
enabled: false
routers:
- label: in-beyond-thunderdome
protocol: http
#httpAccessLog: /dev/stdout # not in production, por favor
servers:
- port: 8080
# on purpose, in-pod nginx sidecar will be able to proxy to
# localhost
ip: 127.0.0.1
# this clears out linkerd headers that let you manipulate cluster
# routing. It is ABSOLUTELY CRITICAL that we don't accept these
# headers from the internet.
clearContext: true
interpreter:
kind: io.l5d.k8s.configMap
experimental: true # sure, why not
name: linkerd-dtabs
filename: beyond-thunderdome
namespace: default
namers:
- kind: io.l5d.k8s
host: 127.0.0.1
port: 8001
telemetry:
- kind: io.l5d.prometheus
path: /admin/metrics/prometheus
prefix: linkerd_
- kind: io.l5d.recentRequests
sampleRate: 0.0001
capacity: 20
slimmed-down dtab ConfigMap looks like this:
kind: ConfigMap
metadata:
name: linkerd-dtabs
data:
beyond-thunderdome: |-
/host/com/example/integrations-dashboard => /#/io.l5d.k8s/integrations-staging/http/integrations-dashboard;
/svc => /$/io.buoyant.http.domainToPathPfx/host ;
A couple of successful deploys were done after the upgrade, but I was notified that one of the services was down, returning 502s after a deployment. This is all low-traffic staging environments, so both the linkerd and the target service are scaled to 1. The service is called integrations-dashboard
in the namespace integrations-staging
. For the sake of this issue, the address is integrations-dashboard.example.com
.
Our logs were full of errors like:
E 1206 01:05:20.259 UTC THREAD28: service failure: Failure(No route to host: /100.96.65.31:3000 at remote address: /100.96.65.31:3000. Remote Info: Not Available, flags=0x09) with RemoteInfo -> Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: /100.96.65.31:3000, Downstream label: #/io.l5d.k8s/integrations-staging/http/integrations-dashboard, Trace Id: d1d6dc90db15ca53.d1d6dc90db15ca53<:d1d6dc90db15ca53
Inspecting the Kubernetes service, the address 100.96.65.31
was incorrect. I can’t say for sure, but I believe this was the internal address of the service prior to their deployment. Interestingly, if I went to the Linkerd admin console dtab playground and put /svc/integrations-staging.example.com
it returned the correct internal address. The Prometheus metrics show a failure accrual removal that corresponded to the exact time they performed a deploy.
We tried another deployment and it didn’t fix anything. I enabled verbose logging, deployed again, and saw this:
D 1206 03:04:31.913 UTC THREAD28: k8s ns integrations-staging service integrations-dashboard modified endpoints
D 1206 03:04:49.014 UTC THREAD28: k8s ns integrations-staging service integrations-dashboard modified endpoints
D 1206 03:04:49.015 UTC THREAD28: k8s ns integrations-staging service integrations-dashboard added Endpoint(/100.96.63.61,Some(ip-172-20-100-132.us-west-2.compute.internal))
D 1206 03:04:49.113 UTC THREAD28: k8s ns integrations-staging service integrations-dashboard modified endpoints
D 1206 03:04:49.114 UTC THREAD28: k8s ns integrations-staging service integrations-dashboard removed Endpoint(/100.97.4.48,Some(ip-172-20-61-49.us-west-2.compute.internal))
E 1206 03:05:02.996 UTC THREAD31: service failure: Failure(connection timed out: /100.96.65.31:3000 at remote address: /100.96.65.31:3000. Remote Info: Not Available, flags=0x09) with RemoteInfo -> Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: /100.96.65.31:3000, Downstream label: #/io.l5d.k8s/integrations-staging/http/integrations-dashboard, Trace Id: d1d6dc90db15ca53.d1d6dc90db15ca53<:d1d6dc90db15ca53
I 1206 03:05:10.192 UTC THREAD28: FailureAccrualFactory marking connection to "#/io.l5d.k8s/integrations-staging/http/integrations-dashboard" as dead. Remote Address: Inet(/100.96.65.31:3000,Map(nodeName -> ip-172-20-63-116.us-west-2.compute.internal))
E 1206 03:05:10.194 UTC THREAD28: service failure: Failure(connection timed out: /100.96.65.31:3000 at remote address: /100.96.65.31:3000. Remote Info: Not Available, flags=0x09) with RemoteInfo -> Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: /100.96.65.31:3000, Downstream label: #/io.l5d.k8s/integrations-staging/http/integrations-dashboard, Trace Id: d1d6dc90db15ca53.d1d6dc90db15ca53<:d1d6dc90db15ca53
It seems like it is seeing the changing endpoints correctly, but it’s fixated on talking to the old endpoint and refuses to give up and move on.
I tried to troubleshoot the problem, but eventually to get the service back up I just restarted linkerd and everything immediately started working.
What you expected to happen:
I expected it to remove the old endpoint from its list when the Kubernetes service told it to.
How to reproduce it (as minimally and precisely as possible):
I wish I knew. I will endeavor to try and reproduce the problem. Like I said, we’ve had a couple of identical deploys occur without an issue.
Anything else we need to know?:
Nope. Thanks for reading this enormous ticket.
Environment:
- Linkerd 1.3.3
- Kubernetes 1.7.2
- Cloud provider or hardware configuration: AWS
Issue Analytics
- State:
- Created 6 years ago
- Comments:60 (43 by maintainers)
Top GitHub Comments
Just wanted to give you all a little update: we’ve finally seen this type of failure in our test cluster, and that’s helped to provide a little more insight into where the bug might be. I’m still investigating the issue, but there has been some progress!
@siggy I don’t, sorry. Since I last commented on this I haven’t had time to revisit it, and we’re continuing to run an older version of Linkerd. Definitely willing to experiment again if a new version has a prospective fix or just additional logging.