question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[1.0.2] Periodical temporary Linkerd outages when `io.l5d.mesh` is used

See original GitHub issue

We believe it’s a bug introduced in version 1.0.2 - we cannot reproduce it with version 1.0.0.

Linkerd config
admin:
  port: 9990

namers:
- kind: io.l5d.consul
  useHealthCheck: true
  consistencyMode: stale
  prefix: /consul

routers:
- protocol: http
  label: egress
  maxRequestKB: 51200
  maxResponseKB: 51200
  maxInitialLineKB: 10
  maxHeadersKB: 65
  dstPrefix: /http
  identifier:
    - kind: io.l5d.header.token
      header: Host
  interpreter:
    kind: io.l5d.mesh
    dst: /#/consul/.local/namerd-grpc
    root: /default
    experimental: true
  bindingCache:
    clients: 1000
  servers:
  - port: 4140
    ip: 0.0.0.0
Namerd config
admin:
  port: 9001
storage:
  ...
namers:
  ...
interfaces:
- kind: io.l5d.thriftNameInterpreter
  cache:
    bindingCacheActive: 10000
  ip: 0.0.0.0
  port: 4100
- kind: io.l5d.httpController
  ip: 0.0.0.0
  port: 4180
- kind: io.l5d.mesh
  ip: 0.0.0.0
  port: 4101

Every several hours Linkerd instance “becomes broken”. It can forward requests to the names that were resolved through it before but cannot forward requests to names there are yet unknown to it (e.g., new service is deployed).

All requests to “new names” result in:

com.twitter.finagle.RequestTimeoutException: exceeded 10.seconds to unspecified while dyn binding /http/new-service-123.acme.co. Remote Info: Not Available

Looking at logs (Consul log level set to ALL) we see that it gets updates from Consul about services being added or removed so we assume that Linkerd knows about health Namerd instances.

After a while (we weren’t able to figure out whether this interval is constant or no) we see a stacktrace like this in logs (root log level set to ALL):

May 25 15:33:52 linkerd01 linkerd: E 0525 19:33:52.348 UTC THREAD278 TraceId:98a06fbda2c87d2e: [C L:/10.10.1.1:46384 R:/10.10.2.2:4101 S:15] unexpected error
May 25 15:33:52 linkerd01 linkerd: com.twitter.finagle.ChannelWriteException: com.twitter.finagle.ChannelClosedException: null at remote address: /10.10.2.2:4101. Remote Info: Not Available from service: io.l5d.mesh. Remote Info:
Upstream Address: /127.0.0.1:34492, Upstream Client Id: Not Available, Downstream Address: /10.10.2.2:4101, Downstream Client Id: io.l5d.mesh, Trace Id: 98a06fbda2c87d2e.dd46fab898edad2f<:98a06fbda2c87d2e
May 25 15:33:52 linkerd01 linkerd: Caused by: com.twitter.finagle.ChannelClosedException: null at remote address: /10.10.2.2:4101. Remote Info: Not Available
May 25 15:33:52 linkerd01 linkerd: at com.twitter.finagle.ChannelException$.apply(Exceptions.scala:253)
May 25 15:33:52 linkerd01 linkerd: at com.twitter.finagle.netty4.transport.ChannelTransport$$anon$2.operationComplete(ChannelTransport.scala:105)
May 25 15:33:52 linkerd01 linkerd: at com.twitter.finagle.netty4.transport.ChannelTransport$$anon$2.operationComplete(ChannelTransport.scala:102)
May 25 15:33:52 linkerd01 linkerd: at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
May 25 15:33:52 linkerd01 linkerd: at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
May 25 15:33:52 linkerd01 linkerd: at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
May 25 15:33:52 linkerd01 linkerd: at io.netty.util.concurrent.DefaultPromise.setFailure(DefaultPromise.java:113)
May 25 15:33:52 linkerd01 linkerd: at io.netty.channel.DefaultChannelPromise.setFailure(DefaultChannelPromise.java:87)
May 25 15:33:52 linkerd01 linkerd: at io.netty.handler.codec.http2.Http2CodecUtil$SimpleChannelPromiseAggregator.setPromise(Http2CodecUtil.java:395)
May 25 15:33:52 linkerd01 linkerd: at io.netty.handler.codec.http2.Http2CodecUtil$SimpleChannelPromiseAggregator.doneAllocatingPromises(Http2CodecUtil.java:314)
May 25 15:33:52 linkerd01 linkerd: at io.netty.handler.codec.http2.DefaultHttp2FrameWriter.writeHeadersInternal(DefaultHttp2FrameWriter.java:484)
May 25 15:33:52 linkerd01 linkerd: at io.netty.handler.codec.http2.DefaultHttp2FrameWriter.writeHeaders(DefaultHttp2FrameWriter.java:200)
May 25 15:33:52 linkerd01 linkerd: at io.netty.handler.codec.http2.Http2OutboundFrameLogger.writeHeaders(Http2OutboundFrameLogger.java:60)
May 25 15:33:52 linkerd01 linkerd: at io.netty.handler.codec.http2.DefaultHttp2ConnectionEncoder.writeHeaders(DefaultHttp2ConnectionEncoder.java:186)
May 25 15:33:52 linkerd01 linkerd: at io.netty.handler.codec.http2.DefaultHttp2ConnectionEncoder.writeHeaders(DefaultHttp2ConnectionEncoder.java:146)
May 25 15:33:52 linkerd01 linkerd: at io.netty.handler.codec.http2.H2FrameCodec.write(H2FrameCodec.scala:141)
May 25 15:33:52 linkerd01 linkerd: at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
May 25 15:33:52 linkerd01 linkerd: at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
May 25 15:33:52 linkerd01 linkerd: at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
May 25 15:33:52 linkerd01 linkerd: at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1089)
May 25 15:33:52 linkerd01 linkerd: at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1136)
May 25 15:33:52 linkerd01 linkerd: at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1078)
May 25 15:33:52 linkerd01 linkerd: at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
May 25 15:33:52 linkerd01 linkerd: at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
May 25 15:33:52 linkerd01 linkerd: at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
May 25 15:33:52 linkerd01 linkerd: at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
May 25 15:33:52 linkerd01 linkerd: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
May 25 15:33:52 linkerd01 linkerd: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
May 25 15:33:52 linkerd01 linkerd: at com.twitter.finagle.util.BlockingTimeTrackingThreadFactory$$anon$1.run(BlockingTimeTrackingThreadFactory.scala:24)
May 25 15:33:52 linkerd01 linkerd: at java.lang.Thread.run(Thread.java:745)
May 25 15:33:52 linkerd01 linkerd: Caused by: java.nio.channels.ClosedChannelException
May 25 15:33:52 linkerd01 linkerd: at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)

where:

  • 10.10.1.1 - linkerd01
  • 10.10.2.1 - namerd01
  • 10.10.2.2 - namerd02

Once such an exception is logged Linkerd is “recovered” and now is able to resolve names not yet known to it.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:3
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
klingerfcommented, Jun 8, 2017

Ok, it looks like what’s happening here is that when linkerd’s client cache is full and it receives a request that requires building a new client, it evicts an existing client and tears it down. If the client has an open stream to namerd (as is the case with the io.l5d.mesh api), the stream is closed on the linkerd side.

As far as I can tell, the stream is not closed on the namerd side, and namerd continues to send updates on the stream, which look like this:

----------------INBOUND--------------------
[id: 0xe254a629, L:/127.0.0.1:64150 - R:/127.0.0.1:5003] DATA: streamId=5, padding=0, endStream=false, length=22, bytes=0000000011220f0a0d080012047f000001188e272200
------------------------------------

Once one of these is sent to linkerd after the stream has been closed, the recv method is no longer able to successfully receive the frame, and it falls into an infinite loop trying to receive it. That happens here:

https://github.com/linkerd/linkerd/blob/master/finagle/h2/src/main/scala/com/twitter/finagle/buoyant/h2/netty4/Netty4StreamTransport.scala#L479

The call to recvFrame returns false and prints “remote offer failed” over and over again:

D 0608 01:31:28.714 UTC THREAD88: [C L:/127.0.0.1:64018 R:/127.0.0.1:5003 S:5] remote offer failed
D 0608 01:31:28.714 UTC THREAD88: [C L:/127.0.0.1:64018 R:/127.0.0.1:5003 S:5] remote offer failed
D 0608 01:31:28.714 UTC THREAD88: [C L:/127.0.0.1:64018 R:/127.0.0.1:5003 S:5] remote offer failed

The way in which remote streams are closed changed as part of #1280, and that’s evidently where this bug was introduced.

Am still investigating, but it seems like there are two approaches to fixing:

  • Figure out why the remote stream on namerd’s side is not being closed when the linkerd tears down the client and closes the stream.
  • Automatically close remote streams when the remote offer fails, rather than retrying indefinitely.

I think we should probably do both but will keep poking around.

1reaction
DukeyToocommented, Jun 6, 2017

I am able to reproduce this as well in 1.0.2, but not in 1.0.0 linkerd. Sanitized linkerd logs look like this:

Jun  3 00:46:06 cwl-mesos-minions-v019-1343 linkerd: com.twitter.finagle.RequestTimeoutException: exceeded 10.seconds to unspecified while dyn binding /http/something. Remote Info: Not Available
Jun  3 00:46:06 cwl-mesos-minions-v019-1343 linkerd: E 0603 00:46:06.705 UTC THREAD10 TraceId:15a4bd604b272638: service failure
Jun  3 00:46:13 cwl-mesos-minions-v019-1343 linkerd: E 0603 00:46:13.686 UTC THREAD10 TraceId:aeb9dffa88f5b1e6: service failure
Jun  3 00:46:13 cwl-mesos-minions-v019-1343 linkerd: com.twitter.finagle.RequestTimeoutException: exceeded 10.seconds to unspecified while dyn binding /http/something. Remote Info: Not Available
Jun  3 00:46:43 cwl-mesos-minions-v019-1343 linkerd: E 0603 00:46:43.875 UTC THREAD10 TraceId:ab394c9057dc9fdc: service failure
Jun  3 00:46:43 cwl-mesos-minions-v019-1343 linkerd: com.twitter.finagle.RequestTimeoutException: exceeded 10.seconds to unspecified while dyn binding /http/something. Remote Info: Not Available
Jun  3 00:47:42 cwl-mesos-minions-v019-1343 linkerd: I 0603 00:47:42.929 UTC THREAD10 TraceId:c73fac633d696d92: FailureAccrualFactory marking connection to "io.l5d.mesh" as dead. Remote Address: Inet(cwl-mesos-masters.service.consul/172.21.0.21:4182,Map())
Jun  3 00:47:42 cwl-mesos-minions-v019-1343 linkerd: W 0603 00:47:42.930 UTC THREAD10 TraceId:c73fac633d696d92: Exception propagated to the default monitor (upstream address: /172.21.0.32:15090, downstream address: namerd.service.consul/172.21.0.21:4182, label: io.l5d.mesh).
Jun  3 00:47:42 cwl-mesos-minions-v019-1343 linkerd: Reset.Cancel
Jun  3 00:47:43 cwl-mesos-minions-v019-1343 linkerd: E 0603 00:47:43.685 UTC THREAD10 TraceId:2a0e066147086bb4: service failure

etc.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting | Linkerd
This section provides resolution steps for common problems reported with the linkerd check command. The “pre-kubernetes-cluster-setup” checks.
Read more >
L5d routing to service fails after re-deploying - Linkerd
We are using L5d for a few months now. Lately we upgraded from L5d version 1.2 to 1.3.2 in order to solve the...
Read more >
A Service Mesh For Kubernetes, Part X: The Service Mesh API
This is, in short, the purpose of Linkerd's service mesh API. To that end, we've introduced the io.l5d.mesh interpreter and a new gRPC...
Read more >
Design of Roadside Channels with Flexible Linings
Unreinforced vegetation and many transitional and temporary linings are suited to hydraulic conditions with moderate shear stresses.
Read more >
C1 PAGE.indd - Neurology Neuroimmunology & Neuroinflammation
Neurology® Neuroimmunology & Neuroinflammation is an official journal of the ... (gray shaded) continues with the formation of cross-linked fibrin mesh by.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found