question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Linkerd stops forwarding gRPC traffic

See original GitHub issue

Issue Type:

  • Bug report
  • Feature request

What happened:

After a time, linkerd stops forwarding traffic to a service. Restarting the affected linkerd instance resumes traffic flow. Bypassing linkerd and making a request directly to the service works.

Here’s a netty frame dump for a failing client request (the l5d-dtab header is there to force linkerd to forward to the failing instance):

2018-09-03 17:01:11,191 DEBUG io.grpc.netty.NettyClientHandler   [id: 0xaeb83799, L:/127.0.0.1:50360 - R:/127.0.0.1:41422] OUTBOUND SETTINGS: ack=false settings={ENABLE_PUSH=0, MAX_CONCURRENT_STREAMS=0, INITIAL_WINDOW_SIZE=1048576, MAX_HEADER_LIST_SIZE=8192}
2018-09-03 17:01:11,223 DEBUG io.grpc.netty.NettyClientHandler   [id: 0xaeb83799, L:/127.0.0.1:50360 - R:/127.0.0.1:41422] OUTBOUND WINDOW_UPDATE: streamId=0 windowSizeIncrement=983041
2018-09-03 17:01:11,295 DEBUG io.grpc.netty.NettyClientHandler   [id: 0xaeb83799, L:/127.0.0.1:50360 - R:/127.0.0.1:41422] INBOUND SETTINGS: ack=false settings={INITIAL_WINDOW_SIZE=1048576, MAX_FRAME_SIZE=4194304}
2018-09-03 17:01:11,298 DEBUG io.grpc.netty.NettyClientHandler   [id: 0xaeb83799, L:/127.0.0.1:50360 - R:/127.0.0.1:41422] OUTBOUND SETTINGS: ack=true
2018-09-03 17:01:11,300 DEBUG io.grpc.netty.NettyClientHandler   [id: 0xaeb83799, L:/127.0.0.1:50360 - R:/127.0.0.1:41422] INBOUND WINDOW_UPDATE: streamId=0 windowSizeIncrement=1966082
2018-09-03 17:01:11,300 DEBUG io.grpc.netty.NettyClientHandler   [id: 0xaeb83799, L:/127.0.0.1:50360 - R:/127.0.0.1:41422] INBOUND SETTINGS: ack=true
2018-09-03 17:01:11,352 DEBUG io.grpc.netty.NettyClientHandler   [id: 0xaeb83799, L:/127.0.0.1:50360 - R:/127.0.0.1:41422] OUTBOUND HEADERS: streamId=3 headers=GrpcHttp2OutboundHeaders[:authority: 127.0.0.1:41422, :path: /bigcommerce.rpc.storeconfig.StoreConfig/GetStore, :method: POST, :scheme: http, content-type: application/grpc, te: trailers, user-agent: grpc-java-netty/1.11.0, l5d-dtab: /svc/* => /$/inet/10.171.25.200/4143, grpc-accept-encoding: gzip, grpc-trace-bin: ] streamDependency=0 weight=16 exclusive=false padding=0 endStream=false
2018-09-03 17:01:11,368 DEBUG io.grpc.netty.NettyClientHandler   [id: 0xaeb83799, L:/127.0.0.1:50360 - R:/127.0.0.1:41422] OUTBOUND DATA: streamId=3 padding=0 endStream=true length=10 bytes=000000000508c8c4e805
2018-09-03 17:01:12,027 DEBUG io.grpc.netty.NettyClientHandler   [id: 0xaeb83799, L:/127.0.0.1:50360 - R:/127.0.0.1:41422] INBOUND RST_STREAM: streamId=3 errorCode=8
[error] (run-main-2) io.grpc.StatusRuntimeException: CANCELLED: HTTP/2 error code: CANCEL
[error] Received Rst Stream
io.grpc.StatusRuntimeException: CANCELLED: HTTP/2 error code: CANCEL
Received Rst Stream
    at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:221)
    at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:202)
    at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:101)
    at com.bigcommerce.storeconfig.StoreConfigGrpc$StoreConfigBlockingStub.getStore(StoreConfigGrpc.scala:132)
    at com.bigcommerce.storeconfig.TestApp$.delayedEndpoint$com$bigcommerce$storeconfig$TestApp$1(TestApp.scala:35)
    at com.bigcommerce.storeconfig.TestApp$delayedInit$body.apply(TestApp.scala:7)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at com.bigcommerce.storeconfig.TestApp$.main(TestApp.scala:7)
    at com.bigcommerce.storeconfig.TestApp.main(TestApp.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
[trace] Stack trace suppressed: run last compile:run for the full output.
2018-09-03 17:01:12,041 DEBUG io.grpc.netty.NettyClientHandler   [id: 0xaeb83799, L:/127.0.0.1:50360 - R:/127.0.0.1:41422] OUTBOUND RST_STREAM: streamId=3 errorCode=8

Here’s a TRACE request that shows something similar:

[production][dal10sl][root@store-app82-p]:~# nghttp -v -H "l5d-add-context: true" -H "l5d-dtab: /svc/* => /$/inet/10.171.25.200/4143;" -H ":method: TRACE" -H "max-forwards: 2" -H ":path: /bigcommerce.rpc.storeconfig.StoreConfig/GetStore"  http://linkerd:4142
[  0.003] Connected
[  0.003] send SETTINGS frame <length=12, flags=0x00, stream_id=0>
          (niv=2)
          [SETTINGS_MAX_CONCURRENT_STREAMS(0x03):100]
          [SETTINGS_INITIAL_WINDOW_SIZE(0x04):65535]
[  0.003] send PRIORITY frame <length=5, flags=0x00, stream_id=3>
          (dep_stream_id=0, weight=201, exclusive=0)
[  0.003] send PRIORITY frame <length=5, flags=0x00, stream_id=5>
          (dep_stream_id=0, weight=101, exclusive=0)
[  0.003] send PRIORITY frame <length=5, flags=0x00, stream_id=7>
          (dep_stream_id=0, weight=1, exclusive=0)
[  0.003] send PRIORITY frame <length=5, flags=0x00, stream_id=9>
          (dep_stream_id=7, weight=1, exclusive=0)
[  0.003] send PRIORITY frame <length=5, flags=0x00, stream_id=11>
          (dep_stream_id=3, weight=1, exclusive=0)
[  0.003] send HEADERS frame <length=137, flags=0x25, stream_id=13>
          ; END_STREAM | END_HEADERS | PRIORITY
          (padlen=0, dep_stream_id=11, weight=16, exclusive=0)
          ; Open new stream
          :method: TRACE
          :path: /bigcommerce.rpc.storeconfig.StoreConfig/GetStore
          :scheme: http
          :authority: linkerd:4142
          accept: */*
          accept-encoding: gzip, deflate
          user-agent: nghttp2/1.31.0
          l5d-add-context: true
          l5d-dtab: /svc/* => /$/inet/10.171.25.200/4143;
          max-forwards: 2
[  0.005] recv SETTINGS frame <length=12, flags=0x00, stream_id=0>
          (niv=2)
          [SETTINGS_INITIAL_WINDOW_SIZE(0x04):1048576]
          [SETTINGS_MAX_FRAME_SIZE(0x05):4194304]
[  0.005] recv WINDOW_UPDATE frame <length=4, flags=0x00, stream_id=0>
          (window_size_increment=1966082)
[  0.005] send SETTINGS frame <length=0, flags=0x01, stream_id=0>
          ; ACK
          (niv=0)
[  0.040] recv SETTINGS frame <length=0, flags=0x01, stream_id=0>
          ; ACK
          (niv=0)
[  0.367] recv RST_STREAM frame <length=4, flags=0x00, stream_id=13>
          (error_code=CANCEL(0x08))
[  0.367] send GOAWAY frame <length=8, flags=0x00, stream_id=0>
          (last_stream_id=0, error_code=NO_ERROR(0x00), opaque_data(0)=[])
Some requests were not processed. total=1, processed=0

Linkerd appears to immediately respond with an H2 reset frame with error code 8 (cancelled). Neither client I’ve written to test linkerd in this state is issuing a cancel, so I have to assume it’s originating form inside linkerd somewhere.

The logs are flooded with this message (which probably should be improved):

Sep 03 19:33:02 nomad-client11-p.dal10sl.bigcommerce.net linkerd[2801]: W 0903 19:33:02.407 CDT THREAD34 TraceId:6eb223c0216a9fc6: Exception propagated to the default monitor (upstream address: /172.17.0.14:36590, downstream address: /10.171.25.200:4143, label: %/io.l5d.port/4143/#/io.l5d.consul/.local/storeconfig).
Sep 03 19:33:02 nomad-client11-p.dal10sl.bigcommerce.net linkerd[2801]: Reset.Cancel

The /client_state.json endpoint reports correct and up-to-date service discovery information.

A cursory glance at a thread dump did not seem to indicate any deadlocked threads.

What you expected to happen:

Linkerd to forward traffic.

How to reproduce it (as minimally and precisely as possible):

Unknown as of yet.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
dadjeibaahcommented, Sep 4, 2018

@zackangelo Thanks for filing this issue. It looks like that diag trace may give us a clue a where linkerd might be inappropriately sending a Reset.Cancel. We will dig into this.

0reactions
zackangelocommented, Sep 19, 2018

An update here: we stopped seeing this issue and haven’t seen any new log messages.

Our environment is pretty dynamic and changing all the time, but one important change we made is we removed a service that was creating HTTP requests and then force-closing the socket from the linkerd path.

I’m going to close for now. If I can reproduce this issue by recreating the bad service’s behavior, I’ll reopen with more detail.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Linkerd stops sending traffic to grpc kubernetes pods
Hi, I have been seen this behavior multiple times now. I am running linkerd:1.3.4. The following is the full set of configuration: ...
Read more >
Unable to configure Linkerd as gRPC load balancer - Linkerd2
Hi, I am trying to configure Linkerd as load balancer for my gRPC client-server ... If the client is outside the cluster and...
Read more >
TCP Proxying and Protocol Detection - Linkerd
Linkerd is capable of proxying all TCP traffic, including TLS'd connections, ... detection to determine whether traffic is HTTP or HTTP/2 (including gRPC)....
Read more >
Linkerd proxy failing to forward outgoing traffic to a headless ...
The logs above are from the Linkerd proxy running with the gRPC client. If I take this proxy out, my connections seem fine....
Read more >
Unable to make gRPC calls via Linkerd - Help
I am having a hard time getting k8s intra-cluster gRPC calls to work. ... kind: io.l5d.static configs: # Use HTTPS if sending to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found