[BUG] Using the tensorrt slow down the inference
See original GitHub issueSystem Info
A100-80G
torch. 1.11.0+cu113
optimum Built by this new PR:https://github.com/huggingface/optimum/pull/586
onnx 1.13.0
onnx-graphsurgeon 0.3.25
onnxconverter-common 1.13.0
onnxoptimizer 0.3.2
onnxruntime 1.13.1
onnxruntime-gpu 1.12.1
onnxruntime-tools 1.7.0
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Firstly, I store the onnx file locally, the path is /opt/tiger/genius/checkpoints/codegen-6B-mono/onnx/.
My code is:
model = ORTModelForCausalLM.from_pretrained(
"Salesforce/codegen-6B-mono/",
from_transformers=True,
provider="TensorrtExecutionProvider"
)
model.save_pretrained("./codegen-6B-mono/onnx")
The structure is: –codegen-6B-mono/onnx ----config.json ----decoder_model.onnx ----decoder_model.onnx_data ----decoder_with_past_model.onnx ----decoder_with_past_model.onnx_data
I follow your PR, to write the code like below. This is your PR: https://github.com/huggingface/optimum/pull/421
tokenizer = AutoTokenizer.from_pretrained("/opt/tiger/genius/preprocess/codegen-6B-mono", fast_tokenizer=True)
model = ORTModelForCausalLM.from_pretrained(
provider="TensorrtExecutionProvider",
use_cache=True,
use_io_binding=True,
model_id = ".codegen-6B-mono/onnx/",
decoder_file_name="./codegen-6B-mono/onnx/decoder_model.onnx",
decoder_with_past_file_name="./codegen-6B-mono/onnx/decoder_with_past_model.onnx"
)
ort_model.save_pretrained("/opt/tiger/genius/checkpoints/codegen-6B-mono-onnx-v2")
inference
local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model=model, device=local_rank, tokenizer=tokenizer)
p0 = datetime.utcnow()
sample = generator("func FormatInt(i int64, base int) string{ ", pad_token_id=tokenizer.eos_token_id,max_new_tokens=128, max_time=20.0, do_sample=True,
temperature=0.8, top_p=0.95, use_cache=True, num_return_sequences=1)
p1 = datetime.utcnow()
print(f"Time difference is {(p1-p0).total_seconds()} seconds")
print (sample)
Expected behavior
The output is 21s. which is slower then the huggingface generate time: 4s. Also, loading the model need around 1 hour. It is not acceptable.
Issue Analytics
- State:
- Created 9 months ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
Extremely slow inference in TensorRT for live semantic ...
A clear and concise description of the bug or issue. The inference on the TensorRT is too slow after converting from onnx to...
Read more >TensorRT seems to slowing down the inference speed #5984
The problem I'm having is that, using TensorRT seems to slowing down the inference speed comparing to using original path file and I...
Read more >How to Convert a Model from PyTorch to TensorRT and ...
Learn how to convert a PyTorch model to TensorRT to speed up inference. We provide step by step instructions with code.
Read more >Hugging Face Transformer Inference Under 1 Millisecond ...
Tutorial to optimize NLP model and *easily* deploy it on OSS production inference server. Benchmark shows that latency is better than ...
Read more >Accelerate PyTorch Model With TensorRT via ONNX - Medium
First you need to pull down the repository and download the TensorRT tar or ... As you can see from the following graph,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
My log: 2022-12-21 18:33:23.301532915 [I:onnxruntime:, inference_session.cc:263 operator()] Flush-to-zero and denormal-as-zero are off 2022-12-21 18:33:23.301563968 [I:onnxruntime:, inference_session.cc:271 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true 2022-12-21 18:33:23.301572917 [I:onnxruntime:, inference_session.cc:292 ConstructorCommon] Dynamic block base set to 0 2022-12-21 18:33:24.851080872 [I:onnxruntime:, inference_session.cc:1222 Initialize] Initializing session. 2022-12-21 18:33:24.851107028 [I:onnxruntime:, inference_session.cc:1259 Initialize] Adding default CPU execution provider. 2022-12-21 18:33:24.851140837 [I:onnxruntime:, session_state.cc:31 SetupAllocators] Allocator already registered for OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]. Ignoring allocator from CUDAExecutionProvider 2022-12-21 18:33:24.851150630 [I:onnxruntime:, session_state.cc:31 SetupAllocators] Allocator already registered for OrtMemoryInfo:[name:CudaPinned id:0 OrtMemType:-1 OrtAllocatorType:1 Device:[DeviceType:0 MemoryType:1 DeviceId:0]]. Ignoring allocator from CUDAExecutionProvider 2022-12-21 18:33:24.851158269 [I:onnxruntime:, session_state.cc:31 SetupAllocators] Allocator already registered for OrtMemoryInfo:[name:CUDA_CPU id:0 OrtMemType:-2 OrtAllocatorType:1 Device:[DeviceType:0 MemoryType:0 DeviceId:0]]. Ignoring allocator from CUDAExecutionProvider 2022-12-21 18:33:24.857593951 [I:onnxruntime:, reshape_fusion.cc:42 ApplyImpl] Total fused reshape node count: 0 2022-12-21 18:33:24.858954698 [I:onnxruntime:, reshape_fusion.cc:42 ApplyImpl] Total fused reshape node count: 0 2022-12-21 18:33:27.940314804 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 log] [2022-12-21 18:33:27 WARNING] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See
CUDA_MODULE_LOADING
in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars 2022-12-21 18:33:28.178308432 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 log] [2022-12-21 18:33:28 WARNING] external/onnx-tensorrt/onnx2trt_utils.cpp:367: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. 2022-12-21 18:33:31.392474189 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 log] [2022-12-21 18:33:31 WARNING] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. SeeCUDA_MODULE_LOADING
in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars 2022-12-21 18:33:32.058666574 [I:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer ‘/distilbert/transformer/layer.5/output_layer_norm/Constant_1_output_0’. It is no longer used by any node. 2022-12-21 18:33:32.058691880 [I:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer ‘/distilbert/transformer/layer.5/ffn/activation/Constant_2_output_0’. It is no longer used by any node.2022-12-21 18:33:32.064914429 [V:onnxruntime:, session_state.cc:1010 VerifyEachNodeIsAssignedToAnEp] Node placements 2022-12-21 18:33:32.064928932 [V:onnxruntime:, session_state.cc:1013 VerifyEachNodeIsAssignedToAnEp] All nodes placed on [TensorrtExecutionProvider]. Number of nodes: 1 2022-12-21 18:33:32.064940012 [V:onnxruntime:, session_state.cc:66 CreateGraphInfo] SaveMLValueNameIndexMapping 2022-12-21 18:33:32.064947366 [V:onnxruntime:, session_state.cc:112 CreateGraphInfo] Done saving OrtValue mappings. 2022-12-21 18:33:32.064966048 [I:onnxruntime:, session_state_utils.cc:199 SaveInitializedTensors] Saving initialized tensors. 2022-12-21 18:33:32.064972535 [I:onnxruntime:, session_state_utils.cc:342 SaveInitializedTensors] Done saving initialized tensors 2022-12-21 18:33:32.065008794 [I:onnxruntime:, inference_session.cc:1488 Initialize] Session successfully initialized. There is no need to do IO binding for TensorrtExecutionProvider,
use_io_binding
is set to False.Thank you for the report!
For the inference latency issue, have you tried to check whether all nodes are effectively placed on the TRT execution provider? You can verify following https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu#use-cuda-execution-provider-with-floatingpoint-models , something like:
For debugging purposes, I’d recommend using a smaller model, as
hf-internal-testing/tiny-random-codegen
or similarActually, codegen with
causal-lm-with-past
is commented out in https://github.com/huggingface/optimum/blob/178728b2b057c446f44acd4b0d122d5259733cb0/optimum/exporters/tasks.py#L243 , so the export withuse_cache
is not tested against. cc @michaelbenayoun do you have an idea why? It seems to stem back from https://github.com/huggingface/optimum/pull/403