Poor Performance on Triton vs inferencing w/o triton
See original GitHub issueDescription Using a B7 Efficientnet model, converted to TensorRT halfprecision. Running the RT engine without anything, 1 image takes ~20ms, 8 images takes ~160ms
When loaded via Triton, same model, 1 image takes ~50ms. 8 images takes 330ms. I am sending results to and from the same physical machine. I understand there’s some overhead from HTTP and stuff but this seems excessive?
My inferencing code:
def inference_rt(engine, data, input_name):
runner = TrtRunner(engine)
with runner:
outputs = runner.infer(feed_dict={input_name: data})
return outputs
Any performance tips or things im missing? Is this typical? thanks
Triton Information 22.03
Are you using the Triton container or did you build it yourself? using 22.03 container
Issue Analytics
- State:
- Created a year ago
- Comments:8 (6 by maintainers)
Top Results From Across the Web
AI Show Live - Episode 47 - High-performance serving with ...
On today's episode of the AI Show, Shivani Santosh Sambare is back to showcase high- performance serving with Triton Inference Server in ...
Read more >Simplifying and Scaling Inference Serving with NVIDIA Triton 2.3
Inference with Triton on A100 provides higher performance than V100 (Figure 4). The A100 with Triton delivered nearly a 3x speedup on both ......
Read more >Acer Predator Triton 500 gaming laptop review - PC Gamer
A pricey premium gaming laptop with whose thin stature doesn't sacrifice power.
Read more >NVIDIA Triton Giant Model Inference, a step too far
Alas, the only thing left is to consider limit the size of models that can be used to perform inferencing. I fear that...
Read more >Acer Predator Triton 300 SE Review: Sleek but So ... - Gizmodo
Acer nails its new aesthetic and offers good performance for the price, but it still misses the mark on a few things.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Using the perf analyzer, the time spent inferencing seems close. And I am on a slower gpu. It’s around 20ms per inference. I think this seems correct. I’m still disappointed by how much overhead and latency there is shifting data between models and in making and receiving requests.
Any tips on reducing that? The time spent not inferencing is several times greater than the time spent inferencing.
This is intended for remote inferencing, so shared memory I dont think is an option.
thank you again for all your support
Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue.