[1.0.2] Periodical temporary Linkerd outages when `io.l5d.mesh` is used
See original GitHub issueWe believe it’s a bug introduced in version 1.0.2
- we cannot reproduce it with version 1.0.0
.
Linkerd config
admin:
port: 9990
namers:
- kind: io.l5d.consul
useHealthCheck: true
consistencyMode: stale
prefix: /consul
routers:
- protocol: http
label: egress
maxRequestKB: 51200
maxResponseKB: 51200
maxInitialLineKB: 10
maxHeadersKB: 65
dstPrefix: /http
identifier:
- kind: io.l5d.header.token
header: Host
interpreter:
kind: io.l5d.mesh
dst: /#/consul/.local/namerd-grpc
root: /default
experimental: true
bindingCache:
clients: 1000
servers:
- port: 4140
ip: 0.0.0.0
Namerd config
admin:
port: 9001
storage:
...
namers:
...
interfaces:
- kind: io.l5d.thriftNameInterpreter
cache:
bindingCacheActive: 10000
ip: 0.0.0.0
port: 4100
- kind: io.l5d.httpController
ip: 0.0.0.0
port: 4180
- kind: io.l5d.mesh
ip: 0.0.0.0
port: 4101
Every several hours Linkerd instance “becomes broken”. It can forward requests to the names that were resolved through it before but cannot forward requests to names there are yet unknown to it (e.g., new service is deployed).
All requests to “new names” result in:
com.twitter.finagle.RequestTimeoutException: exceeded 10.seconds to unspecified while dyn binding /http/new-service-123.acme.co. Remote Info: Not Available
Looking at logs (Consul log level set to ALL) we see that it gets updates from Consul about services being added or removed so we assume that Linkerd knows about health Namerd instances.
After a while (we weren’t able to figure out whether this interval is constant or no) we see a stacktrace like this in logs (root log level set to ALL):
May 25 15:33:52 linkerd01 linkerd: E 0525 19:33:52.348 UTC THREAD278 TraceId:98a06fbda2c87d2e: [C L:/10.10.1.1:46384 R:/10.10.2.2:4101 S:15] unexpected error
May 25 15:33:52 linkerd01 linkerd: com.twitter.finagle.ChannelWriteException: com.twitter.finagle.ChannelClosedException: null at remote address: /10.10.2.2:4101. Remote Info: Not Available from service: io.l5d.mesh. Remote Info:
Upstream Address: /127.0.0.1:34492, Upstream Client Id: Not Available, Downstream Address: /10.10.2.2:4101, Downstream Client Id: io.l5d.mesh, Trace Id: 98a06fbda2c87d2e.dd46fab898edad2f<:98a06fbda2c87d2e
May 25 15:33:52 linkerd01 linkerd: Caused by: com.twitter.finagle.ChannelClosedException: null at remote address: /10.10.2.2:4101. Remote Info: Not Available
May 25 15:33:52 linkerd01 linkerd: at com.twitter.finagle.ChannelException$.apply(Exceptions.scala:253)
May 25 15:33:52 linkerd01 linkerd: at com.twitter.finagle.netty4.transport.ChannelTransport$$anon$2.operationComplete(ChannelTransport.scala:105)
May 25 15:33:52 linkerd01 linkerd: at com.twitter.finagle.netty4.transport.ChannelTransport$$anon$2.operationComplete(ChannelTransport.scala:102)
May 25 15:33:52 linkerd01 linkerd: at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
May 25 15:33:52 linkerd01 linkerd: at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
May 25 15:33:52 linkerd01 linkerd: at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
May 25 15:33:52 linkerd01 linkerd: at io.netty.util.concurrent.DefaultPromise.setFailure(DefaultPromise.java:113)
May 25 15:33:52 linkerd01 linkerd: at io.netty.channel.DefaultChannelPromise.setFailure(DefaultChannelPromise.java:87)
May 25 15:33:52 linkerd01 linkerd: at io.netty.handler.codec.http2.Http2CodecUtil$SimpleChannelPromiseAggregator.setPromise(Http2CodecUtil.java:395)
May 25 15:33:52 linkerd01 linkerd: at io.netty.handler.codec.http2.Http2CodecUtil$SimpleChannelPromiseAggregator.doneAllocatingPromises(Http2CodecUtil.java:314)
May 25 15:33:52 linkerd01 linkerd: at io.netty.handler.codec.http2.DefaultHttp2FrameWriter.writeHeadersInternal(DefaultHttp2FrameWriter.java:484)
May 25 15:33:52 linkerd01 linkerd: at io.netty.handler.codec.http2.DefaultHttp2FrameWriter.writeHeaders(DefaultHttp2FrameWriter.java:200)
May 25 15:33:52 linkerd01 linkerd: at io.netty.handler.codec.http2.Http2OutboundFrameLogger.writeHeaders(Http2OutboundFrameLogger.java:60)
May 25 15:33:52 linkerd01 linkerd: at io.netty.handler.codec.http2.DefaultHttp2ConnectionEncoder.writeHeaders(DefaultHttp2ConnectionEncoder.java:186)
May 25 15:33:52 linkerd01 linkerd: at io.netty.handler.codec.http2.DefaultHttp2ConnectionEncoder.writeHeaders(DefaultHttp2ConnectionEncoder.java:146)
May 25 15:33:52 linkerd01 linkerd: at io.netty.handler.codec.http2.H2FrameCodec.write(H2FrameCodec.scala:141)
May 25 15:33:52 linkerd01 linkerd: at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
May 25 15:33:52 linkerd01 linkerd: at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
May 25 15:33:52 linkerd01 linkerd: at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
May 25 15:33:52 linkerd01 linkerd: at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1089)
May 25 15:33:52 linkerd01 linkerd: at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1136)
May 25 15:33:52 linkerd01 linkerd: at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1078)
May 25 15:33:52 linkerd01 linkerd: at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
May 25 15:33:52 linkerd01 linkerd: at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
May 25 15:33:52 linkerd01 linkerd: at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
May 25 15:33:52 linkerd01 linkerd: at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
May 25 15:33:52 linkerd01 linkerd: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
May 25 15:33:52 linkerd01 linkerd: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
May 25 15:33:52 linkerd01 linkerd: at com.twitter.finagle.util.BlockingTimeTrackingThreadFactory$$anon$1.run(BlockingTimeTrackingThreadFactory.scala:24)
May 25 15:33:52 linkerd01 linkerd: at java.lang.Thread.run(Thread.java:745)
May 25 15:33:52 linkerd01 linkerd: Caused by: java.nio.channels.ClosedChannelException
May 25 15:33:52 linkerd01 linkerd: at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
where:
- 10.10.1.1 - linkerd01
- 10.10.2.1 - namerd01
- 10.10.2.2 - namerd02
Once such an exception is logged Linkerd is “recovered” and now is able to resolve names not yet known to it.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:3
- Comments:8 (6 by maintainers)
Top GitHub Comments
Ok, it looks like what’s happening here is that when linkerd’s client cache is full and it receives a request that requires building a new client, it evicts an existing client and tears it down. If the client has an open stream to namerd (as is the case with the io.l5d.mesh api), the stream is closed on the linkerd side.
As far as I can tell, the stream is not closed on the namerd side, and namerd continues to send updates on the stream, which look like this:
Once one of these is sent to linkerd after the stream has been closed, the
recv
method is no longer able to successfully receive the frame, and it falls into an infinite loop trying to receive it. That happens here:https://github.com/linkerd/linkerd/blob/master/finagle/h2/src/main/scala/com/twitter/finagle/buoyant/h2/netty4/Netty4StreamTransport.scala#L479
The call to
recvFrame
returns false and prints “remote offer failed” over and over again:The way in which remote streams are closed changed as part of #1280, and that’s evidently where this bug was introduced.
Am still investigating, but it seems like there are two approaches to fixing:
I think we should probably do both but will keep poking around.
I am able to reproduce this as well in 1.0.2, but not in 1.0.0 linkerd. Sanitized linkerd logs look like this:
etc.