question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inference worse with onnxruntime-gpu than native pytorch for seq2seq model

See original GitHub issue

System Info

Optimum: 1.4.1.dev0
torch: 1.12.1+cu116
onnx: 1.12.0
onnxruntime-gpu: 1.12.1
python: 3.8.13
CUDA: 11.6
cudnn: 8.4.1
RTX 3090

Who can help?

@JingyaHuang @echarlaix

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

I compared inference on GPU of a native torch Helsinki-NLP/opus-mt-fr-en model with respect to the optimized onnx model thanks to Optimum library. So, I have defined a fastAPI microservice based on two classes below for GPU both torch and optimized ONNX, repsectively:

class Seq2SeqModel:
    tokenizer: Optional[MarianTokenizer]
    model: Optional[MarianMTModel]

    def load_model(self):
        """Loads the model"""
        # model_id="Helsinki-NLP/opus-mt-fr-en"
        model_path = Path("./app/artifacts/HF")
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to("cuda")
        self.tokenizer = tokenizer
        self.model = model

    async def predict(self, input: PredictionInput) -> PredictionOutput:
        """Runs a prediction"""
        if not self.tokenizer or not self.model:
            raise RuntimeError("Model is not loaded")
        tokens = self.tokenizer(input.text, return_tensors="pt").to("cuda")
        translated = self.model.generate(**tokens, num_beams=beam_size)
        return PredictionOutput(translated_text=self.tokenizer.decode(translated[0], skip_special_tokens=True))

class OnnxOptimizedSeq2SeqModel:
    tokenizer: Optional[MarianTokenizer]
    model: Optional[ORTModelForSeq2SeqLM]

    def load_model(self):
        """Loads the model"""
        # model_id="Helsinki-NLP/opus-mt-fr-en"
        onnx_path = Path("./app/artifacts/OL_1")
        tokenizer = AutoTokenizer.from_pretrained(onnx_path)
        optimized_model = ORTModelForSeq2SeqLM.from_pretrained(
            onnx_path,
            encoder_file_name="encoder_model_optimized.onnx",
            decoder_file_name="decoder_model_optimized.onnx",
            decoder_file_with_past_name="decoder_with_past_model_optimized.onnx",
            provider="CUDAExecutionProvider"
        )
        self.tokenizer = tokenizer
        self.model = optimized_model

app = FastAPI()
seq2seq_model = Seq2SeqModel()
onnx_optimized_seq2seq_model = OnnxOptimizedSeq2SeqModel()
beam_size = 3

@app.on_event("startup")
async def startup():
    seq2seq_model.load_model()
    onnx_optimized_seq2seq_model.load_model()

@app.post("/prediction")
async def prediction(
    output: PredictionOutput = Depends(seq2seq_model.predict),
) -> PredictionOutput:
    return output

@app.post("/prediction_onnx_optimized")
async def prediction(
    output: PredictionOutput = Depends(onnx_optimized_seq2seq_model.predict),
) -> PredictionOutput:
    return output

Expected behavior

When load testing the model on my local computer, I was surprised by two things:

  1. The performance on GPU of the optimized ONNX model is worse than the native torch (maybe linked to #365 and #396?) :

GPU_optimized_onnxruntime GPU_torch

  1. When running this fastAPI service into a docker image I got the following warning:

2022-09-28 08:20:21.214094612 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:566 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.

Does this mean the CUDAExecutionProvider is not working even if I set it in?:

        optimized_model = ORTModelForSeq2SeqLM.from_pretrained(
            onnx_path,
            encoder_file_name="encoder_model_optimized.onnx",
            decoder_file_name="decoder_model_optimized.onnx",
            decoder_file_with_past_name="decoder_with_past_model_optimized.onnx",
            provider="CUDAExecutionProvider"
        )

What could be caused that? I saw in https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html that CUDA 11.6 is not mentionned, could it be this?

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:12 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
JingyaHuangcommented, Nov 10, 2022

Hi @soocheolnoh

The fix has been done, there was a bug on the output population, thanks for pointing it out.

Now you shall get the same translation result w/. or w/o. IOBinding.

# Transcript w/. IOBindng
['Machine learning is a branch of artificial intelligence. The history of artificial intelligence has a natural, clear line from "rationality" to "knowledge" to "learning" which is clearly a way to achieve artificial intelligence. Machine learning has been developed for nearly 30 years as a cross-disciplinary discipline involving probability theory, statistics, proximity theory, convex analysis, and computational complexity theory. Machine learning theory primarily designs and analyzes algorithms that allow computers to "learn" automatically.']

Also, share some performance numbers tested with the previous snippet here: (PyTorch V.S Optimum, T4, warm_up_steps=10, loop=100, num_beam=5, max_length=256)

PyTorch Optimum(w/. io)
total(s) 260.7315 145.2276
loop 100 100
avg(s) / seq 2.6073 1.4523
throughput(translation/s) 0.3835 0.6886

The issue is closed, but feel free to reopen it or ping me if you have extra questions about IOBinding. @soocheolnoh @Matthieu-Tinycoaching Thanks again for helping us improve Optimum.

1reaction
JingyaHuangcommented, Nov 9, 2022

Hi @soocheolnoh, thanks for testing.

From my side, for the mt5 model, the generated results are different w/. V.S. w/o. IO binding, which is not normal as IO Binding is not supposed to change the result(only the place to put the data should be different), it might be a bug on the post-processing of the outputs. I will take a closer look, and fix the beam search ASAP.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Inference result is different between Pytorch and ONNX model
Problem Hi, I converted Pytorch model to ONNX model. However, output is different between two models like below. inference environment Pytorch ・python ...
Read more >
Accelerated Inference with Optimum and Transformers Pipelines
Inference has landed in Optimum with support for Hugging Face Transformers pipelines, including text-generation using ONNX Runtime.
Read more >
[P] What we learned by benchmarking TorchDynamo (PyTorch ...
[P] What we learned by benchmarking TorchDynamo (PyTorch team), ONNX Runtime and TensorRT on transformers model (inference) · 1/ GPU model ...
Read more >
failed to create cudaexecutionprovider - You.com | The AI ...
I compared inference on GPU of a native torch Helsinki-NLP/opus-mt-fr-en model with ... on GPU of the optimized ONNX model is worse than...
Read more >
onnxruntime inference is way slower than pytorch on GPU
import torch from torchvision import models import onnxruntime # to inference ONNX models, we use the ONNX Runtime import onnx import os ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found