Linkerd stops forwarding gRPC traffic
See original GitHub issueIssue Type:
- Bug report
- Feature request
What happened:
After a time, linkerd stops forwarding traffic to a service. Restarting the affected linkerd instance resumes traffic flow. Bypassing linkerd and making a request directly to the service works.
Here’s a netty frame dump for a failing client request (the l5d-dtab
header is there to force linkerd to forward to the failing instance):
2018-09-03 17:01:11,191 DEBUG io.grpc.netty.NettyClientHandler [id: 0xaeb83799, L:/127.0.0.1:50360 - R:/127.0.0.1:41422] OUTBOUND SETTINGS: ack=false settings={ENABLE_PUSH=0, MAX_CONCURRENT_STREAMS=0, INITIAL_WINDOW_SIZE=1048576, MAX_HEADER_LIST_SIZE=8192}
2018-09-03 17:01:11,223 DEBUG io.grpc.netty.NettyClientHandler [id: 0xaeb83799, L:/127.0.0.1:50360 - R:/127.0.0.1:41422] OUTBOUND WINDOW_UPDATE: streamId=0 windowSizeIncrement=983041
2018-09-03 17:01:11,295 DEBUG io.grpc.netty.NettyClientHandler [id: 0xaeb83799, L:/127.0.0.1:50360 - R:/127.0.0.1:41422] INBOUND SETTINGS: ack=false settings={INITIAL_WINDOW_SIZE=1048576, MAX_FRAME_SIZE=4194304}
2018-09-03 17:01:11,298 DEBUG io.grpc.netty.NettyClientHandler [id: 0xaeb83799, L:/127.0.0.1:50360 - R:/127.0.0.1:41422] OUTBOUND SETTINGS: ack=true
2018-09-03 17:01:11,300 DEBUG io.grpc.netty.NettyClientHandler [id: 0xaeb83799, L:/127.0.0.1:50360 - R:/127.0.0.1:41422] INBOUND WINDOW_UPDATE: streamId=0 windowSizeIncrement=1966082
2018-09-03 17:01:11,300 DEBUG io.grpc.netty.NettyClientHandler [id: 0xaeb83799, L:/127.0.0.1:50360 - R:/127.0.0.1:41422] INBOUND SETTINGS: ack=true
2018-09-03 17:01:11,352 DEBUG io.grpc.netty.NettyClientHandler [id: 0xaeb83799, L:/127.0.0.1:50360 - R:/127.0.0.1:41422] OUTBOUND HEADERS: streamId=3 headers=GrpcHttp2OutboundHeaders[:authority: 127.0.0.1:41422, :path: /bigcommerce.rpc.storeconfig.StoreConfig/GetStore, :method: POST, :scheme: http, content-type: application/grpc, te: trailers, user-agent: grpc-java-netty/1.11.0, l5d-dtab: /svc/* => /$/inet/10.171.25.200/4143, grpc-accept-encoding: gzip, grpc-trace-bin: ] streamDependency=0 weight=16 exclusive=false padding=0 endStream=false
2018-09-03 17:01:11,368 DEBUG io.grpc.netty.NettyClientHandler [id: 0xaeb83799, L:/127.0.0.1:50360 - R:/127.0.0.1:41422] OUTBOUND DATA: streamId=3 padding=0 endStream=true length=10 bytes=000000000508c8c4e805
2018-09-03 17:01:12,027 DEBUG io.grpc.netty.NettyClientHandler [id: 0xaeb83799, L:/127.0.0.1:50360 - R:/127.0.0.1:41422] INBOUND RST_STREAM: streamId=3 errorCode=8
[error] (run-main-2) io.grpc.StatusRuntimeException: CANCELLED: HTTP/2 error code: CANCEL
[error] Received Rst Stream
io.grpc.StatusRuntimeException: CANCELLED: HTTP/2 error code: CANCEL
Received Rst Stream
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:221)
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:202)
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:101)
at com.bigcommerce.storeconfig.StoreConfigGrpc$StoreConfigBlockingStub.getStore(StoreConfigGrpc.scala:132)
at com.bigcommerce.storeconfig.TestApp$.delayedEndpoint$com$bigcommerce$storeconfig$TestApp$1(TestApp.scala:35)
at com.bigcommerce.storeconfig.TestApp$delayedInit$body.apply(TestApp.scala:7)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at com.bigcommerce.storeconfig.TestApp$.main(TestApp.scala:7)
at com.bigcommerce.storeconfig.TestApp.main(TestApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
[trace] Stack trace suppressed: run last compile:run for the full output.
2018-09-03 17:01:12,041 DEBUG io.grpc.netty.NettyClientHandler [id: 0xaeb83799, L:/127.0.0.1:50360 - R:/127.0.0.1:41422] OUTBOUND RST_STREAM: streamId=3 errorCode=8
Here’s a TRACE
request that shows something similar:
[production][dal10sl][root@store-app82-p]:~# nghttp -v -H "l5d-add-context: true" -H "l5d-dtab: /svc/* => /$/inet/10.171.25.200/4143;" -H ":method: TRACE" -H "max-forwards: 2" -H ":path: /bigcommerce.rpc.storeconfig.StoreConfig/GetStore" http://linkerd:4142
[ 0.003] Connected
[ 0.003] send SETTINGS frame <length=12, flags=0x00, stream_id=0>
(niv=2)
[SETTINGS_MAX_CONCURRENT_STREAMS(0x03):100]
[SETTINGS_INITIAL_WINDOW_SIZE(0x04):65535]
[ 0.003] send PRIORITY frame <length=5, flags=0x00, stream_id=3>
(dep_stream_id=0, weight=201, exclusive=0)
[ 0.003] send PRIORITY frame <length=5, flags=0x00, stream_id=5>
(dep_stream_id=0, weight=101, exclusive=0)
[ 0.003] send PRIORITY frame <length=5, flags=0x00, stream_id=7>
(dep_stream_id=0, weight=1, exclusive=0)
[ 0.003] send PRIORITY frame <length=5, flags=0x00, stream_id=9>
(dep_stream_id=7, weight=1, exclusive=0)
[ 0.003] send PRIORITY frame <length=5, flags=0x00, stream_id=11>
(dep_stream_id=3, weight=1, exclusive=0)
[ 0.003] send HEADERS frame <length=137, flags=0x25, stream_id=13>
; END_STREAM | END_HEADERS | PRIORITY
(padlen=0, dep_stream_id=11, weight=16, exclusive=0)
; Open new stream
:method: TRACE
:path: /bigcommerce.rpc.storeconfig.StoreConfig/GetStore
:scheme: http
:authority: linkerd:4142
accept: */*
accept-encoding: gzip, deflate
user-agent: nghttp2/1.31.0
l5d-add-context: true
l5d-dtab: /svc/* => /$/inet/10.171.25.200/4143;
max-forwards: 2
[ 0.005] recv SETTINGS frame <length=12, flags=0x00, stream_id=0>
(niv=2)
[SETTINGS_INITIAL_WINDOW_SIZE(0x04):1048576]
[SETTINGS_MAX_FRAME_SIZE(0x05):4194304]
[ 0.005] recv WINDOW_UPDATE frame <length=4, flags=0x00, stream_id=0>
(window_size_increment=1966082)
[ 0.005] send SETTINGS frame <length=0, flags=0x01, stream_id=0>
; ACK
(niv=0)
[ 0.040] recv SETTINGS frame <length=0, flags=0x01, stream_id=0>
; ACK
(niv=0)
[ 0.367] recv RST_STREAM frame <length=4, flags=0x00, stream_id=13>
(error_code=CANCEL(0x08))
[ 0.367] send GOAWAY frame <length=8, flags=0x00, stream_id=0>
(last_stream_id=0, error_code=NO_ERROR(0x00), opaque_data(0)=[])
Some requests were not processed. total=1, processed=0
Linkerd appears to immediately respond with an H2 reset frame with error code 8 (cancelled). Neither client I’ve written to test linkerd in this state is issuing a cancel, so I have to assume it’s originating form inside linkerd somewhere.
The logs are flooded with this message (which probably should be improved):
Sep 03 19:33:02 nomad-client11-p.dal10sl.bigcommerce.net linkerd[2801]: W 0903 19:33:02.407 CDT THREAD34 TraceId:6eb223c0216a9fc6: Exception propagated to the default monitor (upstream address: /172.17.0.14:36590, downstream address: /10.171.25.200:4143, label: %/io.l5d.port/4143/#/io.l5d.consul/.local/storeconfig).
Sep 03 19:33:02 nomad-client11-p.dal10sl.bigcommerce.net linkerd[2801]: Reset.Cancel
The /client_state.json
endpoint reports correct and up-to-date service discovery information.
A cursory glance at a thread dump did not seem to indicate any deadlocked threads.
What you expected to happen:
Linkerd to forward traffic.
How to reproduce it (as minimally and precisely as possible):
Unknown as of yet.
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (6 by maintainers)
Top GitHub Comments
@zackangelo Thanks for filing this issue. It looks like that diag trace may give us a clue a where linkerd might be inappropriately sending a
Reset.Cancel
. We will dig into this.An update here: we stopped seeing this issue and haven’t seen any new log messages.
Our environment is pretty dynamic and changing all the time, but one important change we made is we removed a service that was creating HTTP requests and then force-closing the socket from the linkerd path.
I’m going to close for now. If I can reproduce this issue by recreating the bad service’s behavior, I’ll reopen with more detail.