question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Experiencing Bottlenecking at Scale - is it related to having a single gRPC connection?

See original GitHub issue

Description Hello. My team leverages communicating with Triton over gRPC from our server written in Go. We followed the simple example here, but we are now experiencing what we believe is a bottleneck due to only establishing a single client on our Go server via this call.

client := triton.NewGRPCInferenceServiceClient(conn)

We did some digging in the perf_analyzer code and we noticed that in order to implement concurrency, a new Triton client is established per thread.

This has us believing that the correct way to implement high-volume communication to Triton is to have a connection pool between our Go server and Triton. However, none of the examples or documentation points to this, so we figured it would be best to ask the experts here.

Triton Information What version of Triton are you using? 21.09-py3

Are you using the Triton container or did you build it yourself? Triton container and self-built experience the same issue.

To Reproduce Steps to reproduce the behavior.

We have verified this via the following:

  1. Create an application that sends 1000 inference requests as quickly as possible to Triton and make note of the request / second
  2. Create three instances of the same application running on separate threads, and note a ~3x throughput

We feel very confident that the bottleneck is not within the application itself, and we believe we have isolated the bottleneck to the client connection.

Our model is a simple resnet18 model.

Expected behavior Documentation to describe the best practice for high-volume communication with Triton

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
bryanmcgranecommented, Feb 9, 2022

We are seeing the same result between our server and Triton. Thanks for the help!

0reactions
tanmayv25commented, Feb 9, 2022

Yes. That’s the change I was talking about. We see better at-scale performance with this change for heavy load.

Read more comments on GitHub >

github_iconTop Results From Across the Web

The Mysterious Gotcha of gRPC Stream Performance | Ably Blog
gRPC is highly useful for fast, efficient data exchange and client/server state sync. Here's a performance gotcha we ran across.
Read more >
Performance Best Practices - gRPC
A user guide of both general and language-specific best practices to improve performance.
Read more >
An Introduction to gRPC - Mattermost
The HTTP/1.1 responses must come back in the order received, which can cause a processing bottleneck. You can use multiple TCP connections to ......
Read more >
Load balancing and scaling long-lived connections in ...
TL;DR: Kubernetes doesn't load balance long-lived connections, and some Pods might receive more requests than others. If you're using HTTP/2, gRPC, ...
Read more >
RPC vs. Messaging – which is faster? - Particular Software
Ignoring all the other advantages messaging has, they'll ask us the ... or technology like REST, microservices, gRPC, WCF, Java RMI, etc.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found