Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

linkerd 1.3.3 issue with gRPC

See original GitHub issue

Issue Type:

Bug report
Feature request

What happened: I updated linkerd from 1.1.2 to 1.3.3 in DCOS 1.9 and started getting the following errors in gRPC clients:

from go client:

2017/12/05 17:30:05 rpc error: code = ResourceExhausted desc = grpc: received message larger than max (845559858 vs. 4194304)

from .net core client:

Unhandled Exception: Grpc.Core.RpcException: Status(StatusCode=Internal, Detail="Failed to deserialize response message.")
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Grpc.Core.Internal.AsyncCall`2.UnaryCall(TRequest msg)
   at Grpc.Core.DefaultCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at Grpc.Core.Internal.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)

It looks like the message becomes invalid. The errors don’t happen all the time, I need to call that endpoint multiple times. The message itself is not that big - around 40KB.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible): I tried running linkerd locally, but couldn’t reproduce this issue.

Anything else we need to know?:

Environment:

linkerd/namerd version, config files: linkerd 1.3.3, no namerd
Platform, version, and config files (Kubernetes, DC/OS, etc): DCOS Enterprise 1.9
Cloud provider or hardware configuration: AWS

linkerd configuration:

---
usage:
  enabled: false
admin:
  port: 9990
  ip: 0.0.0.0
namers:
- kind: io.l5d.marathon
  host: leader.mesos
  port: 443
  uriPrefix: "/marathon"
  prefix: "/io.l5d.marathon"
  tls:
    commonName: master.mesos
    trustCerts:
    - "/mnt/mesos/sandbox/.ssl/ca.crt"
routers:
- protocol: h2
  experimental: true
  identifier:
    kind: io.l5d.header.token
  dtab: "/ph=>/$/io.buoyant.rinet;/srv=>/$/io.buoyant.porthostPfx/ph;/svc=>/srv;/marathonId=>/#/io.l5d.marathon;/svc=>/$/io.buoyant.http.domainToPathPfx/marathonId"
  servers:
  - port: 4140
    ip: 0.0.0.0

Issue Analytics

State:
Created 6 years ago
Comments:14 (7 by maintainers)

Top GitHub Comments

2reactions

vadimicommented, Dec 13, 2017

@wmorgan I was able to troubleshoot it a bit further. I added some logs to gRPC client to get http2 framing info:

2017/12/13 15:24:23 [FrameHeader HEADERS flags=END_HEADERS|PRIORITY stream=59 len=10]
2017/12/13 15:24:23 [FrameHeader DATA stream=59 len=16384]
2017/12/13 15:24:23 [FrameHeader DATA stream=59 len=16384]
2017/12/13 15:24:23 [FrameHeader DATA stream=59 len=16384]
2017/12/13 15:24:23 [FrameHeader DATA stream=59 len=1761]
2017/12/13 15:24:23 [FrameHeader HEADERS flags=END_STREAM|END_HEADERS|PRIORITY stream=59 len=8]
2017/12/13 15:24:23 [FrameHeader HEADERS flags=END_HEADERS|PRIORITY stream=61 len=10]
2017/12/13 15:24:23 [FrameHeader DATA stream=61 len=16384]
2017/12/13 15:24:23 rpc error: code = ResourceExhausted desc = grpc: received message larger than max (845559858 vs. 4194304)

I call the same endpoint with the same parameters several times (unary call). Stream 59 is a successful one. Stream 61 is when the error happens. Basically when it reads http2 data frame it should be length-prefixed (at least this is what gRPC expects), so gRPC reads first several bytes to get the length of the message, but this message from stream 61 doesn’t have length header, so gRPC reads a few bytes of the message instead and throws an error.

Are there any logs I can get you from linkerd? I still cannot reproduce the issue locally.

1reaction

vadimicommented, Dec 8, 2017

I tried to reproduce it in my local cluster, but unfortunately no luck. But I can steadily reproduce it two AWS DCOS clusters, one is 1.9, another is 1.10.

Considering the issue started happening between 1.3.2 and 1.3.3, I tried to find a specific commit. Here is what I’ve got. I’ve build two linkerd docker images, one from commit https://github.com/linkerd/linkerd/commit/c6f0d2eaeecca80c60314e6f6cb852a31870877a, another from commit https://github.com/linkerd/linkerd/commit/0bd8a91ed51fecc34b86f110382e2077a7b88600. The issue is happening with the first one, but not the second one. It looks like https://github.com/linkerd/linkerd/commit/c6f0d2eaeecca80c60314e6f6cb852a31870877a contributes to the issue somehow.