Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question: node grpc versus grpc-js performance

See original GitHub issue

My team just updated our internal tools to use grpc-js in lieu of grpc. A primary motivation for this is that we want to be able to use later versions of node and no longer use a library that’s deprecated.

After a couple weeks, we’ve identified that the node 12 + grpc to node 14 + grpc-js upgrade has resulted in significant performance degradation for one of our endpoints. This endpoint makes a protobuf network call for a repeated type that returns 100+ objects. Locally we’re seeing a 3x latency increase between the two versions of the endpoint, but we haven’t debugged the issue more than just swapping the libraries so far, so there is a chance that the issue could be specific to our app.

Is this a known tradeoff between the two libraries, potentially for example that serialization takes longer with the raw JS version? Are there recommended optimizations or configurations that I might not be following that would be contributing to our problem? I would really prefer to keep grpc-js and try to optimize code rather than doing a rollback if possible.

Issue Analytics

State:
Created 2 years ago
Reactions:3
Comments:21 (9 by maintainers)

Top GitHub Comments

2reactions

hardboiledcommented, Jun 16, 2021

I spent the last day debugging this, and here are my findings so far… For background, I have a go-based GRPC server running in the cloud, and previously I was using a node-based grpc client also running in the cloud to make requests to this server.

Upgrading to grpc-js from the legacy grpc resulted in a 4-5x performance degradation on my grpc invocations. Initially I suspected data serialization to be the issue due to our large response bodies, but from running various profilers, it seems like the http2 connections are sitting in async time much longer. I confirmed that data serialization wasn’t the problem by running both the node client and the go server locally (without TLS, not sure if that will make a difference here), and each call dropped to a 100ms-ish time, which is dramatically lower than the typical call time.

Are there any connection settings that I should be looking into to improve the connection responsiveness?

Seems odd to have connections hanging so much longer than with the legacy package.

Just in case this info might be helpful, I ran my client with GRPC_TRACE=call_stream and GRPC_VERBOSITY=DEBUG in both a remote server and local server scenario, and the data frames received from the remote server are much smaller:

# remote
2021-06-16T10:43:11.965Z | call_stream | [0] receive HTTP/2 data frame of length 9
2021-06-16T10:43:11.965Z | call_stream | [0] receive HTTP/2 data frame of length 4087
# ...

# local
2021-06-16T10:45:56.470Z | call_stream | [0] receive HTTP/2 data frame of length 16384
2021-06-16T10:45:56.471Z | call_stream | [0] receive HTTP/2 data frame of length 16384
# ...

Note sure if this is a symptom of the behavior I mentioned or could help root cause some potential improvements. Thanks!

1reaction

hardboiledcommented, Jun 30, 2021

Update: current solution: If you reference my protobuf definitions above, we were adding k8s object annotations (map<string, string> annotations) into the repeated WeK8sGenericService services. We included these for developer velocity purposes, so that lower-level services wouldn’t need to have extra protobuf field definitions updated every time our higher-level graphql server wanted something from the annotations.

When I removed this field and replaced it with one-to-one field definitions for what we actually were pulling, the call time dropped dramatically on local testing to around 100-200ms (Note that all other protobuf calls we benchmarked for grpc-js against grpc legacy were comparable, so we were only concerned about this endpoint).

Notes: Why did this help? Normally I would think that perhaps the amount of information being returned from the annotations block was just drawfing all else. However, when I outputted the results to a file the file size didn’t seem to be proportional to the performance gained from excluding it. I’d have to run more tests to be sure though.

I also tested just using a raw string instead of a map type, and that didn’t improve performance at all.

Thus, I’m not sure exactly why this was causing performance problems for my use-case. I would also say that legacy grpc was impacted as well, since it was slow without these changes, just not as slow as our grpc-js setup.

Thus, for now, we’re happy with just reducing our message size with these changes and leaving it there, even though it’s a bit of an inconvenience to have to update field definitions in multiple places every time we want to expose something from annotations to our graphql server.

If you want me to investigate further, I could pull some hard metrics comparing with and without the annotations block for grpc legacy and grpc-js, but for now, I’ll regard this issue as resolved. Thanks for your help!