Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Experiencing Bottlenecking at Scale - is it related to having a single gRPC connection?

See original GitHub issue

Description Hello. My team leverages communicating with Triton over gRPC from our server written in Go. We followed the simple example here, but we are now experiencing what we believe is a bottleneck due to only establishing a single client on our Go server via this call.

client := triton.NewGRPCInferenceServiceClient(conn)

We did some digging in the perf_analyzer code and we noticed that in order to implement concurrency, a new Triton client is established per thread.

This has us believing that the correct way to implement high-volume communication to Triton is to have a connection pool between our Go server and Triton. However, none of the examples or documentation points to this, so we figured it would be best to ask the experts here.

Triton Information What version of Triton are you using? 21.09-py3

Are you using the Triton container or did you build it yourself? Triton container and self-built experience the same issue.

To Reproduce Steps to reproduce the behavior.

We have verified this via the following:

Create an application that sends 1000 inference requests as quickly as possible to Triton and make note of the request / second
Create three instances of the same application running on separate threads, and note a ~3x throughput

We feel very confident that the bottleneck is not within the application itself, and we believe we have isolated the bottleneck to the client connection.

Our model is a simple resnet18 model.

Expected behavior Documentation to describe the best practice for high-volume communication with Triton

Issue Analytics

State:
Created 2 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

bryanmcgranecommented, Feb 9, 2022

We are seeing the same result between our server and Triton. Thanks for the help!

0reactions

tanmayv25commented, Feb 9, 2022

Yes. That’s the change I was talking about. We see better at-scale performance with this change for heavy load.

Top Results From Across the Web

The Mysterious Gotcha of gRPC Stream Performance | Ably Blog

gRPC is highly useful for fast, efficient data exchange and client/server state sync. Here's a performance gotcha we ran across.

Performance Best Practices - gRPC

A user guide of both general and language-specific best practices to improve performance.

An Introduction to gRPC - Mattermost

The HTTP/1.1 responses must come back in the order received, which can cause a processing bottleneck. You can use multiple TCP connections to ......

Load balancing and scaling long-lived connections in ...

TL;DR: Kubernetes doesn't load balance long-lived connections, and some Pods might receive more requests than others. If you're using HTTP/2, gRPC, ...

RPC vs. Messaging – which is faster? - Particular Software

Ignoring all the other advantages messaging has, they'll ask us the ... or technology like REST, microservices, gRPC, WCF, Java RMI, etc.