Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Using the tensorrt slow down the inference

See original GitHub issue

System Info

A100-80G
torch. 1.11.0+cu113
optimum                     Built by this new PR:https://github.com/huggingface/optimum/pull/586
onnx                         1.13.0
onnx-graphsurgeon            0.3.25
onnxconverter-common         1.13.0
onnxoptimizer                0.3.2
onnxruntime                  1.13.1


onnxruntime-gpu              1.12.1
onnxruntime-tools            1.7.0

Who can help?

@JingyaHuang @NouamaneTazi

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

Firstly, I store the onnx file locally, the path is /opt/tiger/genius/checkpoints/codegen-6B-mono/onnx/.

My code is:

model = ORTModelForCausalLM.from_pretrained(
"Salesforce/codegen-6B-mono/",
from_transformers=True,
provider="TensorrtExecutionProvider"
)
model.save_pretrained("./codegen-6B-mono/onnx")

The structure is: –codegen-6B-mono/onnx ----config.json ----decoder_model.onnx ----decoder_model.onnx_data ----decoder_with_past_model.onnx ----decoder_with_past_model.onnx_data

I follow your PR, to write the code like below. This is your PR: https://github.com/huggingface/optimum/pull/421

tokenizer = AutoTokenizer.from_pretrained("/opt/tiger/genius/preprocess/codegen-6B-mono", fast_tokenizer=True)
model = ORTModelForCausalLM.from_pretrained(
    provider="TensorrtExecutionProvider",
    use_cache=True, 
    use_io_binding=True,
    model_id = ".codegen-6B-mono/onnx/",
    decoder_file_name="./codegen-6B-mono/onnx/decoder_model.onnx",
    decoder_with_past_file_name="./codegen-6B-mono/onnx/decoder_with_past_model.onnx"
)
ort_model.save_pretrained("/opt/tiger/genius/checkpoints/codegen-6B-mono-onnx-v2")

inference

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model=model, device=local_rank, tokenizer=tokenizer)
p0 = datetime.utcnow()
sample = generator("func FormatInt(i int64, base int) string{      ", pad_token_id=tokenizer.eos_token_id,max_new_tokens=128, max_time=20.0, do_sample=True,
                                        temperature=0.8, top_p=0.95, use_cache=True, num_return_sequences=1)
p1 = datetime.utcnow()
print(f"Time difference is {(p1-p0).total_seconds()} seconds")
print (sample)

Expected behavior

The output is 21s. which is slower then the huggingface generate time: 4s. Also, loading the model need around 1 hour. It is not acceptable.

Issue Analytics

State:
Created 9 months ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

PoodleWangcommented, Dec 21, 2022

causal-lm-with-past

My log: 2022-12-21 18:33:23.301532915 [I:onnxruntime:, inference_session.cc:263 operator()] Flush-to-zero and denormal-as-zero are off 2022-12-21 18:33:23.301563968 [I:onnxruntime:, inference_session.cc:271 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true 2022-12-21 18:33:23.301572917 [I:onnxruntime:, inference_session.cc:292 ConstructorCommon] Dynamic block base set to 0 2022-12-21 18:33:24.851080872 [I:onnxruntime:, inference_session.cc:1222 Initialize] Initializing session. 2022-12-21 18:33:24.851107028 [I:onnxruntime:, inference_session.cc:1259 Initialize] Adding default CPU execution provider. 2022-12-21 18:33:24.851140837 [I:onnxruntime:, session_state.cc:31 SetupAllocators] Allocator already registered for OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]. Ignoring allocator from CUDAExecutionProvider 2022-12-21 18:33:24.851150630 [I:onnxruntime:, session_state.cc:31 SetupAllocators] Allocator already registered for OrtMemoryInfo:[name:CudaPinned id:0 OrtMemType:-1 OrtAllocatorType:1 Device:[DeviceType:0 MemoryType:1 DeviceId:0]]. Ignoring allocator from CUDAExecutionProvider 2022-12-21 18:33:24.851158269 [I:onnxruntime:, session_state.cc:31 SetupAllocators] Allocator already registered for OrtMemoryInfo:[name:CUDA_CPU id:0 OrtMemType:-2 OrtAllocatorType:1 Device:[DeviceType:0 MemoryType:0 DeviceId:0]]. Ignoring allocator from CUDAExecutionProvider 2022-12-21 18:33:24.857593951 [I:onnxruntime:, reshape_fusion.cc:42 ApplyImpl] Total fused reshape node count: 0 2022-12-21 18:33:24.858954698 [I:onnxruntime:, reshape_fusion.cc:42 ApplyImpl] Total fused reshape node count: 0 2022-12-21 18:33:27.940314804 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 log] [2022-12-21 18:33:27 WARNING] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars 2022-12-21 18:33:28.178308432 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 log] [2022-12-21 18:33:28 WARNING] external/onnx-tensorrt/onnx2trt_utils.cpp:367: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. 2022-12-21 18:33:31.392474189 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 log] [2022-12-21 18:33:31 WARNING] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars 2022-12-21 18:33:32.058666574 [I:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer ‘/distilbert/transformer/layer.5/output_layer_norm/Constant_1_output_0’. It is no longer used by any node. 2022-12-21 18:33:32.058691880 [I:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer ‘/distilbert/transformer/layer.5/ffn/activation/Constant_2_output_0’. It is no longer used by any node.

2022-12-21 18:33:32.064914429 [V:onnxruntime:, session_state.cc:1010 VerifyEachNodeIsAssignedToAnEp] Node placements 2022-12-21 18:33:32.064928932 [V:onnxruntime:, session_state.cc:1013 VerifyEachNodeIsAssignedToAnEp] All nodes placed on [TensorrtExecutionProvider]. Number of nodes: 1 2022-12-21 18:33:32.064940012 [V:onnxruntime:, session_state.cc:66 CreateGraphInfo] SaveMLValueNameIndexMapping 2022-12-21 18:33:32.064947366 [V:onnxruntime:, session_state.cc:112 CreateGraphInfo] Done saving OrtValue mappings. 2022-12-21 18:33:32.064966048 [I:onnxruntime:, session_state_utils.cc:199 SaveInitializedTensors] Saving initialized tensors. 2022-12-21 18:33:32.064972535 [I:onnxruntime:, session_state_utils.cc:342 SaveInitializedTensors] Done saving initialized tensors 2022-12-21 18:33:32.065008794 [I:onnxruntime:, inference_session.cc:1488 Initialize] Session successfully initialized. There is no need to do IO binding for TensorrtExecutionProvider, use_io_binding is set to False.

0reactions

fxmartycommented, Dec 21, 2022

Thank you for the report!

For the inference latency issue, have you tried to check whether all nodes are effectively placed on the TRT execution provider? You can verify following https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu#use-cuda-execution-provider-with-floatingpoint-models , something like:

import onnxruntime

session_options = onnxruntime.SessionOptions()
session_options.log_severity_level = 0

ort_model = ORTModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english",
    from_transformers=True,
    provider="TensorrtExecutionProvider",
    session_options=session_options
)

For debugging purposes, I’d recommend using a smaller model, as hf-internal-testing/tiny-random-codegen or similar

Actually, codegen with causal-lm-with-past is commented out in https://github.com/huggingface/optimum/blob/178728b2b057c446f44acd4b0d122d5259733cb0/optimum/exporters/tasks.py#L243 , so the export with use_cache is not tested against. cc @michaelbenayoun do you have an idea why? It seems to stem back from https://github.com/huggingface/optimum/pull/403