question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Using the tensorrt slow down the inference

See original GitHub issue

System Info

A100-80G
torch. 1.11.0+cu113
optimum                     Built by this new PR:https://github.com/huggingface/optimum/pull/586
onnx                         1.13.0
onnx-graphsurgeon            0.3.25
onnxconverter-common         1.13.0
onnxoptimizer                0.3.2
onnxruntime                  1.13.1


onnxruntime-gpu              1.12.1
onnxruntime-tools            1.7.0

Who can help?

@JingyaHuang @NouamaneTazi

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

Firstly, I store the onnx file locally, the path is /opt/tiger/genius/checkpoints/codegen-6B-mono/onnx/.

My code is:

model = ORTModelForCausalLM.from_pretrained(
"Salesforce/codegen-6B-mono/",
from_transformers=True,
provider="TensorrtExecutionProvider"
)
model.save_pretrained("./codegen-6B-mono/onnx")

The structure is: –codegen-6B-mono/onnx ----config.json ----decoder_model.onnx ----decoder_model.onnx_data ----decoder_with_past_model.onnx ----decoder_with_past_model.onnx_data

I follow your PR, to write the code like below. This is your PR: https://github.com/huggingface/optimum/pull/421

tokenizer = AutoTokenizer.from_pretrained("/opt/tiger/genius/preprocess/codegen-6B-mono", fast_tokenizer=True)
model = ORTModelForCausalLM.from_pretrained(
    provider="TensorrtExecutionProvider",
    use_cache=True, 
    use_io_binding=True,
    model_id = ".codegen-6B-mono/onnx/",
    decoder_file_name="./codegen-6B-mono/onnx/decoder_model.onnx",
    decoder_with_past_file_name="./codegen-6B-mono/onnx/decoder_with_past_model.onnx"
)
ort_model.save_pretrained("/opt/tiger/genius/checkpoints/codegen-6B-mono-onnx-v2")

inference

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model=model, device=local_rank, tokenizer=tokenizer)
p0 = datetime.utcnow()
sample = generator("func FormatInt(i int64, base int) string{      ", pad_token_id=tokenizer.eos_token_id,max_new_tokens=128, max_time=20.0, do_sample=True,
                                        temperature=0.8, top_p=0.95, use_cache=True, num_return_sequences=1)
p1 = datetime.utcnow()
print(f"Time difference is {(p1-p0).total_seconds()} seconds")
print (sample)

Expected behavior

The output is 21s. which is slower then the huggingface generate time: 4s. Also, loading the model need around 1 hour. It is not acceptable.

Issue Analytics

  • State:open
  • Created 9 months ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
PoodleWangcommented, Dec 21, 2022

causal-lm-with-past

My log: 2022-12-21 18:33:23.301532915 [I:onnxruntime:, inference_session.cc:263 operator()] Flush-to-zero and denormal-as-zero are off 2022-12-21 18:33:23.301563968 [I:onnxruntime:, inference_session.cc:271 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true 2022-12-21 18:33:23.301572917 [I:onnxruntime:, inference_session.cc:292 ConstructorCommon] Dynamic block base set to 0 2022-12-21 18:33:24.851080872 [I:onnxruntime:, inference_session.cc:1222 Initialize] Initializing session. 2022-12-21 18:33:24.851107028 [I:onnxruntime:, inference_session.cc:1259 Initialize] Adding default CPU execution provider. 2022-12-21 18:33:24.851140837 [I:onnxruntime:, session_state.cc:31 SetupAllocators] Allocator already registered for OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]. Ignoring allocator from CUDAExecutionProvider 2022-12-21 18:33:24.851150630 [I:onnxruntime:, session_state.cc:31 SetupAllocators] Allocator already registered for OrtMemoryInfo:[name:CudaPinned id:0 OrtMemType:-1 OrtAllocatorType:1 Device:[DeviceType:0 MemoryType:1 DeviceId:0]]. Ignoring allocator from CUDAExecutionProvider 2022-12-21 18:33:24.851158269 [I:onnxruntime:, session_state.cc:31 SetupAllocators] Allocator already registered for OrtMemoryInfo:[name:CUDA_CPU id:0 OrtMemType:-2 OrtAllocatorType:1 Device:[DeviceType:0 MemoryType:0 DeviceId:0]]. Ignoring allocator from CUDAExecutionProvider 2022-12-21 18:33:24.857593951 [I:onnxruntime:, reshape_fusion.cc:42 ApplyImpl] Total fused reshape node count: 0 2022-12-21 18:33:24.858954698 [I:onnxruntime:, reshape_fusion.cc:42 ApplyImpl] Total fused reshape node count: 0 2022-12-21 18:33:27.940314804 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 log] [2022-12-21 18:33:27 WARNING] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars 2022-12-21 18:33:28.178308432 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 log] [2022-12-21 18:33:28 WARNING] external/onnx-tensorrt/onnx2trt_utils.cpp:367: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. 2022-12-21 18:33:31.392474189 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 log] [2022-12-21 18:33:31 WARNING] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars 2022-12-21 18:33:32.058666574 [I:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer ‘/distilbert/transformer/layer.5/output_layer_norm/Constant_1_output_0’. It is no longer used by any node. 2022-12-21 18:33:32.058691880 [I:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer ‘/distilbert/transformer/layer.5/ffn/activation/Constant_2_output_0’. It is no longer used by any node.

2022-12-21 18:33:32.064914429 [V:onnxruntime:, session_state.cc:1010 VerifyEachNodeIsAssignedToAnEp] Node placements 2022-12-21 18:33:32.064928932 [V:onnxruntime:, session_state.cc:1013 VerifyEachNodeIsAssignedToAnEp] All nodes placed on [TensorrtExecutionProvider]. Number of nodes: 1 2022-12-21 18:33:32.064940012 [V:onnxruntime:, session_state.cc:66 CreateGraphInfo] SaveMLValueNameIndexMapping 2022-12-21 18:33:32.064947366 [V:onnxruntime:, session_state.cc:112 CreateGraphInfo] Done saving OrtValue mappings. 2022-12-21 18:33:32.064966048 [I:onnxruntime:, session_state_utils.cc:199 SaveInitializedTensors] Saving initialized tensors. 2022-12-21 18:33:32.064972535 [I:onnxruntime:, session_state_utils.cc:342 SaveInitializedTensors] Done saving initialized tensors 2022-12-21 18:33:32.065008794 [I:onnxruntime:, inference_session.cc:1488 Initialize] Session successfully initialized. There is no need to do IO binding for TensorrtExecutionProvider, use_io_binding is set to False.

0reactions
fxmartycommented, Dec 21, 2022

Thank you for the report!

For the inference latency issue, have you tried to check whether all nodes are effectively placed on the TRT execution provider? You can verify following https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu#use-cuda-execution-provider-with-floatingpoint-models , something like:

import onnxruntime

session_options = onnxruntime.SessionOptions()
session_options.log_severity_level = 0

ort_model = ORTModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english",
    from_transformers=True,
    provider="TensorrtExecutionProvider",
    session_options=session_options
)

For debugging purposes, I’d recommend using a smaller model, as hf-internal-testing/tiny-random-codegen or similar

Actually, codegen with causal-lm-with-past is commented out in https://github.com/huggingface/optimum/blob/178728b2b057c446f44acd4b0d122d5259733cb0/optimum/exporters/tasks.py#L243 , so the export with use_cache is not tested against. cc @michaelbenayoun do you have an idea why? It seems to stem back from https://github.com/huggingface/optimum/pull/403

Read more comments on GitHub >

github_iconTop Results From Across the Web

Extremely slow inference in TensorRT for live semantic ...
A clear and concise description of the bug or issue. The inference on the TensorRT is too slow after converting from onnx to...
Read more >
TensorRT seems to slowing down the inference speed #5984
The problem I'm having is that, using TensorRT seems to slowing down the inference speed comparing to using original path file and I...
Read more >
How to Convert a Model from PyTorch to TensorRT and ...
Learn how to convert a PyTorch model to TensorRT to speed up inference. We provide step by step instructions with code.
Read more >
Hugging Face Transformer Inference Under 1 Millisecond ...
Tutorial to optimize NLP model and *easily* deploy it on OSS production inference server. Benchmark shows that latency is better than ...
Read more >
Accelerate PyTorch Model With TensorRT via ONNX - Medium
First you need to pull down the repository and download the TensorRT tar or ... As you can see from the following graph,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found