Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inference worse with onnxruntime-gpu than native pytorch for seq2seq model

See original GitHub issue

System Info

Optimum: 1.4.1.dev0
torch: 1.12.1+cu116
onnx: 1.12.0
onnxruntime-gpu: 1.12.1
python: 3.8.13
CUDA: 11.6
cudnn: 8.4.1
RTX 3090

Who can help?

@JingyaHuang @echarlaix

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

I compared inference on GPU of a native torch Helsinki-NLP/opus-mt-fr-en model with respect to the optimized onnx model thanks to Optimum library. So, I have defined a fastAPI microservice based on two classes below for GPU both torch and optimized ONNX, repsectively:

class Seq2SeqModel:
    tokenizer: Optional[MarianTokenizer]
    model: Optional[MarianMTModel]

    def load_model(self):
        """Loads the model"""
        # model_id="Helsinki-NLP/opus-mt-fr-en"
        model_path = Path("./app/artifacts/HF")
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to("cuda")
        self.tokenizer = tokenizer
        self.model = model

    async def predict(self, input: PredictionInput) -> PredictionOutput:
        """Runs a prediction"""
        if not self.tokenizer or not self.model:
            raise RuntimeError("Model is not loaded")
        tokens = self.tokenizer(input.text, return_tensors="pt").to("cuda")
        translated = self.model.generate(**tokens, num_beams=beam_size)
        return PredictionOutput(translated_text=self.tokenizer.decode(translated[0], skip_special_tokens=True))

class OnnxOptimizedSeq2SeqModel:
    tokenizer: Optional[MarianTokenizer]
    model: Optional[ORTModelForSeq2SeqLM]

    def load_model(self):
        """Loads the model"""
        # model_id="Helsinki-NLP/opus-mt-fr-en"
        onnx_path = Path("./app/artifacts/OL_1")
        tokenizer = AutoTokenizer.from_pretrained(onnx_path)
        optimized_model = ORTModelForSeq2SeqLM.from_pretrained(
            onnx_path,
            encoder_file_name="encoder_model_optimized.onnx",
            decoder_file_name="decoder_model_optimized.onnx",
            decoder_file_with_past_name="decoder_with_past_model_optimized.onnx",
            provider="CUDAExecutionProvider"
        )
        self.tokenizer = tokenizer
        self.model = optimized_model

app = FastAPI()
seq2seq_model = Seq2SeqModel()
onnx_optimized_seq2seq_model = OnnxOptimizedSeq2SeqModel()
beam_size = 3

@app.on_event("startup")
async def startup():
    seq2seq_model.load_model()
    onnx_optimized_seq2seq_model.load_model()

@app.post("/prediction")
async def prediction(
    output: PredictionOutput = Depends(seq2seq_model.predict),
) -> PredictionOutput:
    return output

@app.post("/prediction_onnx_optimized")
async def prediction(
    output: PredictionOutput = Depends(onnx_optimized_seq2seq_model.predict),
) -> PredictionOutput:
    return output

Expected behavior

When load testing the model on my local computer, I was surprised by two things:

The performance on GPU of the optimized ONNX model is worse than the native torch (maybe linked to #365 and #396?) :

GPU_optimized_onnxruntime GPU_torch

When running this fastAPI service into a docker image I got the following warning:

2022-09-28 08:20:21.214094612 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:566 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.

Does this mean the CUDAExecutionProvider is not working even if I set it in?:

        optimized_model = ORTModelForSeq2SeqLM.from_pretrained(
            onnx_path,
            encoder_file_name="encoder_model_optimized.onnx",
            decoder_file_name="decoder_model_optimized.onnx",
            decoder_file_with_past_name="decoder_with_past_model_optimized.onnx",
            provider="CUDAExecutionProvider"
        )

What could be caused that? I saw in https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html that CUDA 11.6 is not mentionned, could it be this?

Issue Analytics

State:
Created a year ago
Comments:12 (7 by maintainers)

Top GitHub Comments

1reaction

JingyaHuangcommented, Nov 10, 2022

Hi @soocheolnoh

The fix has been done, there was a bug on the output population, thanks for pointing it out.

Now you shall get the same translation result w/. or w/o. IOBinding.

# Transcript w/. IOBindng
['Machine learning is a branch of artificial intelligence. The history of artificial intelligence has a natural, clear line from "rationality" to "knowledge" to "learning" which is clearly a way to achieve artificial intelligence. Machine learning has been developed for nearly 30 years as a cross-disciplinary discipline involving probability theory, statistics, proximity theory, convex analysis, and computational complexity theory. Machine learning theory primarily designs and analyzes algorithms that allow computers to "learn" automatically.']

Also, share some performance numbers tested with the previous snippet here: (PyTorch V.S Optimum, T4, warm_up_steps=10, loop=100, num_beam=5, max_length=256)

	PyTorch	Optimum(w/. io)
total(s)	260.7315	145.2276
loop	100	100
avg(s) / seq	2.6073	1.4523
throughput(translation/s)	0.3835	0.6886

The issue is closed, but feel free to reopen it or ping me if you have extra questions about IOBinding. @soocheolnoh @Matthieu-Tinycoaching Thanks again for helping us improve Optimum.

1reaction

JingyaHuangcommented, Nov 9, 2022

Hi @soocheolnoh, thanks for testing.

From my side, for the mt5 model, the generated results are different w/. V.S. w/o. IO binding, which is not normal as IO Binding is not supposed to change the result(only the place to put the data should be different), it might be a bug on the post-processing of the outputs. I will take a closer look, and fix the beam search ASAP.

Top Results From Across the Web

Inference result is different between Pytorch and ONNX model

Problem Hi, I converted Pytorch model to ONNX model. However, output is different between two models like below. inference environment Pytorch ・python ...

Accelerated Inference with Optimum and Transformers Pipelines

Inference has landed in Optimum with support for Hugging Face Transformers pipelines, including text-generation using ONNX Runtime.

[P] What we learned by benchmarking TorchDynamo (PyTorch ...

[P] What we learned by benchmarking TorchDynamo (PyTorch team), ONNX Runtime and TensorRT on transformers model (inference) · 1/ GPU model ...

failed to create cudaexecutionprovider - You.com | The AI ...

I compared inference on GPU of a native torch Helsinki-NLP/opus-mt-fr-en model with ... on GPU of the optimized ONNX model is worse than...

onnxruntime inference is way slower than pytorch on GPU

import torch from torchvision import models import onnxruntime # to inference ONNX models, we use the ONNX Runtime import onnx import os ...