Experiencing Bottlenecking at Scale - is it related to having a single gRPC connection?
See original GitHub issueDescription Hello. My team leverages communicating with Triton over gRPC from our server written in Go. We followed the simple example here, but we are now experiencing what we believe is a bottleneck due to only establishing a single client on our Go server via this call.
client := triton.NewGRPCInferenceServiceClient(conn)
We did some digging in the perf_analyzer code and we noticed that in order to implement concurrency, a new Triton client is established per thread.
This has us believing that the correct way to implement high-volume communication to Triton is to have a connection pool between our Go server and Triton. However, none of the examples or documentation points to this, so we figured it would be best to ask the experts here.
Triton Information What version of Triton are you using? 21.09-py3
Are you using the Triton container or did you build it yourself? Triton container and self-built experience the same issue.
To Reproduce Steps to reproduce the behavior.
We have verified this via the following:
- Create an application that sends 1000 inference requests as quickly as possible to Triton and make note of the request / second
- Create three instances of the same application running on separate threads, and note a ~3x throughput
We feel very confident that the bottleneck is not within the application itself, and we believe we have isolated the bottleneck to the client connection.
Our model is a simple resnet18 model.
Expected behavior Documentation to describe the best practice for high-volume communication with Triton
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
We are seeing the same result between our server and Triton. Thanks for the help!
Yes. That’s the change I was talking about. We see better at-scale performance with this change for heavy load.