Speed difference ONNX vs TensorRT with samples sorted by sequence length
See original GitHub issueI noticed something unexpected when comparing two scenarios for a model converted via ONNX and TensorRT (distilroberta with classification head):
- Scenario: I use a dataset with varying sentence lengths (~20-60 tokens) and run it randomly sampled through both models
- Scenario: I use the same dataset but sort the sentences by sentence length (decreasing) before running it through both models
Result: The TensorRT model does not seem to care about the sequence lengths and keeps the same speed for both scenarios. The ONNX model, however, gets almost twice as fast when I use the second scenario.
I was wondering if tensorRT’s optimization does somehow require to pad to the max length internally. I was searching for a parameter or a reason for this behavior but couldn’t find anything useful. For conversion, I set the seq-len parameter to 1 60 60
.
I was wondering if perhaps someone else has already observed this and knows the reason / a solution.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Speeding Up Deep Learning Inference Using TensorFlow ...
In this post, you learn how to deploy TensorFlow trained deep learning models using the new TensorFlow-ONNX-TensorRT workflow. This tutorial ...
Read more >Accelerated inference on NVIDIA GPUs - Hugging Face
This guide will show you how to run inference on two execution providers that ONNX Runtime supports for NVIDIA GPUs: CUDAExecutionProvider : Generic ......
Read more >ONNX Runtime Performance Tuning
TensorRT and CUDA are separate execution providers for ONNX Runtime. On the same hardware, TensorRT will generally provide better performance; however, this ...
Read more >An empirical approach to speedup your BERT inference with ...
Using Torchscript or ONNX does provide significant speedup for lower batch size and sequence length, the effect is particularly strong when ...
Read more >Optimizing and deploying transformer INT8 inference with ...
Figure 2: Compute latency comparison between ONNX Runtime-TensorRT and PyTorch for running BERT-Large on NVIDIA A100 GPU for sequence length 128 ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hey @pommedeterresautee, sorry for the long wait - I was on a holiday trip.
I based my script on your demo scripts but I cannot disclose the model and/or dataset. You can basically use any dataset with 2 inputs, e.g. example for QA. I hope you can make use of it anyway.
I attached the script to call the inference assemble hosted in triton (transformer_onnx_inference or transformer_trt_inference) and the slightly modified model.py for the tokenize endpoint in triton.
If you experience the same what I do, then calling the ONNX model’s inference endpoint should be slower if you comment out the length sorting in
triton_inference_qa_test.py
and there should be no difference if you do the same for the trt model’s inference.triton_inference_qa_test.py
model.py
can you provide me with some reproducible code so I test on my side?