Consul namer does not pick up on changes in consul anymore after consul was not reachable
See original GitHub issueWe are using linkerd with a consul namer. If there was a short connection error to consul for whatever reasons, linkerd seems to fail to re-connect to consul and will not receive any updates on where a service is available from consul anymore.
How to reproduce:
Tested with linkerd 0.9.0 and consul 0.8.1
Start a random web-service, in this case running on port 9012
Consul Config:
{
"service": {
"name": "my-service",
"port": 9012
},
"checks": [
{
"name": "my-service-health",
"http": "http://localhost:9012/health",
"service_id": "my-service",
"interval": "5s"
}
]
}
Start consul
consul agent -dev -advertise 127.0.0.1 -config-dir=/path/to/config
Linkerd config:
admin:
port: 9991
namers:
- kind: io.l5d.consul
host: localhost
port: 8500
useHealthCheck: true
routers:
- protocol: http
dtab: |
/svc => /#/io.l5d.consul/dc1;
label: consul
servers:
- port: 4140
ip: 0.0.0.0
Start linkerd
linkerd ./linkerd.yaml
Do a request to my-service via linkerd
$ curl http://localhost:4140/ -H 'Host: my-service' -I
HTTP/1.1 200 OK
...
Now kill my-service. and do another request for this service to linkerd
$ curl http://localhost:4140/ -H 'Host: my-service' -I
The request returns a 502.
In the linkerd log you see the expected
0420 15:30:51.422 UTC THREAD19 TraceId:1f9aa75f534a156b: #/io.l5d.consul/dc1/my-service: name resolution is negative (local dtab: Dtab())
E 0420 15:30:52.245 UTC THREAD22 TraceId:c0f9420907d4ff25: service failure
com.twitter.finagle.NoBrokersAvailableException: No hosts are available for /svc/my-service, Dtab.base=[/svc=>/#/io.l5d.consul/dc1], Dtab.local=[]. Remote Info: Not Available
Start my-service again.
Subsequent requests to my-service via linkerd will of course work again
$ curl http://localhost:4140/ -H 'Host: my-service' -I
HTTP/1.1 200 OK
...
Now stop consul and start it again
The linkerd log shows:
Failure(java.net.ConnectException: Connection refused: localhost/127.0.0.1:8500. Remote Info: Upstream Address: /127.0.0.1:51163, Upstream Client Id: Not Available, Downstream Address: localhost/127.0.0.1:8500, Downstream Client Id: #/io.l5d.consul, Trace Id: 73204f8c5072fd90.73204f8c5072fd90<:73204f8c5072fd90, flags=0x10) with NoSources
Caused by: com.twitter.finagle.ChannelWriteException: java.net.ConnectException: Connection refused: localhost/127.0.0.1:8500. Remote Info: Upstream Address: /127.0.0.1:51163, Upstream Client Id: Not Available, Downstream Address: localhost/127.0.0.1:8500, Downstream Client Id: #/io.l5d.consul, Trace Id: 73204f8c5072fd90.73204f8c5072fd90<:73204f8c5072fd90
Caused by: java.net.ConnectException: Connection refused: localhost/127.0.0.1:8500
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152)
at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105)
at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at com.twitter.finagle.util.ProxyThreadFactory$$anonfun$newProxiedRunnable$1$$anon$1.run(ProxyThreadFactory.scala:19)
at java.lang.Thread.run(Thread.java:745)
Now kill my-service. and do another request for this service to linkerd
$ curl http://localhost:4140/ -H 'Host: my-service' -I
The request returns a 502 but linkerd still tries to connecto to port 9012 even though this is down in consul.
From the linkerd log
E 0420 15:11:47.832 UTC THREAD28 TraceId:0a943f473a987f04: service failure
com.twitter.finagle.ChannelWriteException: java.net.ConnectException: Connection refused: /127.0.0.1:9012. Remote Info: Upstream Address: /127.0.0.1:51799, Upstream Client Id: Not Available, Downstream Address: /127.0.0.1:9012, Downstream Client Id: #/io.l5d.consul/dc1/my-service, Trace Id: 0a943f473a987f04.0a943f473a987f04<:0a943f473a987f04
Caused by: java.net.ConnectException: Connection refused: /127.0.0.1:9012
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152)
at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105)
at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at com.twitter.finagle.util.ProxyThreadFactory$$anonfun$newProxiedRunnable$1$$anon$1.run(ProxyThreadFactory.scala:19)
at java.lang.Thread.run(Thread.java:745)
If I query consul for running nodes it returns nothing:
curl http://localhost:8500/v1/health/service/my-service\?passing\=true
[]
So after the short consul restart linkerd apparently did not connect back to consul and did not pick up of the change that there is no host available anymore for the requested service, but still tries to connecto to some (probably) cached information. We observerd the same behavior if we add/remove hosts to consul.
Issue Analytics
- State:
- Created 6 years ago
- Comments:9 (7 by maintainers)
Top GitHub Comments
For the record, v1.0.2 recovers 5 minutes after consul restart (whereas previously, recovery never occurred)
@bashofmann metrics from your instance would be helpful in debugging, thanks!