Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Consul namer does not pick up on changes in consul anymore after consul was not reachable

See original GitHub issue

We are using linkerd with a consul namer. If there was a short connection error to consul for whatever reasons, linkerd seems to fail to re-connect to consul and will not receive any updates on where a service is available from consul anymore.

How to reproduce:

Tested with linkerd 0.9.0 and consul 0.8.1

Start a random web-service, in this case running on port 9012

Consul Config:

{
    "service": {
        "name": "my-service", 
        "port": 9012
    }, 
    "checks": [
        {
            "name": "my-service-health",
            "http": "http://localhost:9012/health",
            "service_id": "my-service",
            "interval": "5s"
        }
    ]
}

Start consul

consul agent -dev -advertise 127.0.0.1 -config-dir=/path/to/config

Linkerd config:

admin:
  port: 9991

namers:
- kind: io.l5d.consul
  host: localhost
  port: 8500
  useHealthCheck: true

routers:
- protocol: http
  dtab: |
    /svc  => /#/io.l5d.consul/dc1;
  label: consul
  servers:
  - port: 4140
    ip: 0.0.0.0

Start linkerd

linkerd ./linkerd.yaml

Do a request to my-service via linkerd

$ curl http://localhost:4140/ -H 'Host: my-service' -I
HTTP/1.1 200 OK
...

Now kill my-service. and do another request for this service to linkerd

$ curl http://localhost:4140/ -H 'Host: my-service' -I

The request returns a 502.

In the linkerd log you see the expected

0420 15:30:51.422 UTC THREAD19 TraceId:1f9aa75f534a156b: #/io.l5d.consul/dc1/my-service: name resolution is negative (local dtab: Dtab())
E 0420 15:30:52.245 UTC THREAD22 TraceId:c0f9420907d4ff25: service failure
com.twitter.finagle.NoBrokersAvailableException: No hosts are available for /svc/my-service, Dtab.base=[/svc=>/#/io.l5d.consul/dc1], Dtab.local=[]. Remote Info: Not Available

Start my-service again.

Subsequent requests to my-service via linkerd will of course work again

$ curl http://localhost:4140/ -H 'Host: my-service' -I
HTTP/1.1 200 OK
...

Now stop consul and start it again

The linkerd log shows:

Failure(java.net.ConnectException: Connection refused: localhost/127.0.0.1:8500. Remote Info: Upstream Address: /127.0.0.1:51163, Upstream Client Id: Not Available, Downstream Address: localhost/127.0.0.1:8500, Downstream Client Id: #/io.l5d.consul, Trace Id: 73204f8c5072fd90.73204f8c5072fd90<:73204f8c5072fd90, flags=0x10) with NoSources
Caused by: com.twitter.finagle.ChannelWriteException: java.net.ConnectException: Connection refused: localhost/127.0.0.1:8500. Remote Info: Upstream Address: /127.0.0.1:51163, Upstream Client Id: Not Available, Downstream Address: localhost/127.0.0.1:8500, Downstream Client Id: #/io.l5d.consul, Trace Id: 73204f8c5072fd90.73204f8c5072fd90<:73204f8c5072fd90
Caused by: java.net.ConnectException: Connection refused: localhost/127.0.0.1:8500
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152)
    at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105)
    at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79)
    at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
    at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
    at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
    at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at com.twitter.finagle.util.ProxyThreadFactory$$anonfun$newProxiedRunnable$1$$anon$1.run(ProxyThreadFactory.scala:19)
    at java.lang.Thread.run(Thread.java:745)

Now kill my-service. and do another request for this service to linkerd

$ curl http://localhost:4140/ -H 'Host: my-service' -I

The request returns a 502 but linkerd still tries to connecto to port 9012 even though this is down in consul.

From the linkerd log

E 0420 15:11:47.832 UTC THREAD28 TraceId:0a943f473a987f04: service failure
com.twitter.finagle.ChannelWriteException: java.net.ConnectException: Connection refused: /127.0.0.1:9012. Remote Info: Upstream Address: /127.0.0.1:51799, Upstream Client Id: Not Available, Downstream Address: /127.0.0.1:9012, Downstream Client Id: #/io.l5d.consul/dc1/my-service, Trace Id: 0a943f473a987f04.0a943f473a987f04<:0a943f473a987f04
Caused by: java.net.ConnectException: Connection refused: /127.0.0.1:9012
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152)
    at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105)
    at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79)
    at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
    at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
    at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
    at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at com.twitter.finagle.util.ProxyThreadFactory$$anonfun$newProxiedRunnable$1$$anon$1.run(ProxyThreadFactory.scala:19)
    at java.lang.Thread.run(Thread.java:745)

If I query consul for running nodes it returns nothing:

curl http://localhost:8500/v1/health/service/my-service\?passing\=true
[]

So after the short consul restart linkerd apparently did not connect back to consul and did not pick up of the change that there is no host available anymore for the requested service, but still tries to connecto to some (probably) cached information. We observerd the same behavior if we add/remove hosts to consul.

Issue Analytics

State:
Created 6 years ago
Comments:9 (7 by maintainers)

Top GitHub Comments

1reaction

esbiecommented, May 19, 2017

For the record, v1.0.2 recovers 5 minutes after consul restart (whereas previously, recovery never occurred)

0reactions

esbiecommented, May 15, 2017

@bashofmann metrics from your instance would be helpful in debugging, thanks!

Top Results From Across the Web

Common Error Messages - Troubleshoot | Consul

Troubleshoot issues based on the error message. Common errors result from failed actions, timeouts, multiple entries, bad and expired certificates, ...

DNS lookup "dig amq.service.consul" not returning the correct ...

I'm not able to resolve my new service amq.service.consul after installing consul and dnsmasq. When I enter "dig amq.service.consul" it seems my dns...

The Consul outage that never happened - GitLab

Luckily, there is nothing that prevents changing these ahead of time since the changes aren't picked up until the service restarts. We didn't ......

HashiCorp Consul – Index - Wilson Mar

E. Also due to lack of authentication, current routing does not have the ... F. Consul can take automatic action when its metadata...

Inversion of Control for the Infrastructure with Consul - Medium

There are no more cleanup tasks. Since all data lives within the service, all configuration is cleaned up automatically once the last instance ......