Inference worse with onnxruntime-gpu than native pytorch for seq2seq model
See original GitHub issueSystem Info
Optimum: 1.4.1.dev0
torch: 1.12.1+cu116
onnx: 1.12.0
onnxruntime-gpu: 1.12.1
python: 3.8.13
CUDA: 11.6
cudnn: 8.4.1
RTX 3090
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
I compared inference on GPU of a native torch Helsinki-NLP/opus-mt-fr-en
model with respect to the optimized onnx model thanks to Optimum library. So, I have defined a fastAPI microservice based on two classes below for GPU both torch and optimized ONNX, repsectively:
class Seq2SeqModel:
tokenizer: Optional[MarianTokenizer]
model: Optional[MarianMTModel]
def load_model(self):
"""Loads the model"""
# model_id="Helsinki-NLP/opus-mt-fr-en"
model_path = Path("./app/artifacts/HF")
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to("cuda")
self.tokenizer = tokenizer
self.model = model
async def predict(self, input: PredictionInput) -> PredictionOutput:
"""Runs a prediction"""
if not self.tokenizer or not self.model:
raise RuntimeError("Model is not loaded")
tokens = self.tokenizer(input.text, return_tensors="pt").to("cuda")
translated = self.model.generate(**tokens, num_beams=beam_size)
return PredictionOutput(translated_text=self.tokenizer.decode(translated[0], skip_special_tokens=True))
class OnnxOptimizedSeq2SeqModel:
tokenizer: Optional[MarianTokenizer]
model: Optional[ORTModelForSeq2SeqLM]
def load_model(self):
"""Loads the model"""
# model_id="Helsinki-NLP/opus-mt-fr-en"
onnx_path = Path("./app/artifacts/OL_1")
tokenizer = AutoTokenizer.from_pretrained(onnx_path)
optimized_model = ORTModelForSeq2SeqLM.from_pretrained(
onnx_path,
encoder_file_name="encoder_model_optimized.onnx",
decoder_file_name="decoder_model_optimized.onnx",
decoder_file_with_past_name="decoder_with_past_model_optimized.onnx",
provider="CUDAExecutionProvider"
)
self.tokenizer = tokenizer
self.model = optimized_model
app = FastAPI()
seq2seq_model = Seq2SeqModel()
onnx_optimized_seq2seq_model = OnnxOptimizedSeq2SeqModel()
beam_size = 3
@app.on_event("startup")
async def startup():
seq2seq_model.load_model()
onnx_optimized_seq2seq_model.load_model()
@app.post("/prediction")
async def prediction(
output: PredictionOutput = Depends(seq2seq_model.predict),
) -> PredictionOutput:
return output
@app.post("/prediction_onnx_optimized")
async def prediction(
output: PredictionOutput = Depends(onnx_optimized_seq2seq_model.predict),
) -> PredictionOutput:
return output
Expected behavior
When load testing the model on my local computer, I was surprised by two things:
- The performance on GPU of the optimized ONNX model is worse than the native torch (maybe linked to #365 and #396?) :
- When running this fastAPI service into a docker image I got the following warning:
2022-09-28 08:20:21.214094612 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:566 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.
Does this mean the CUDAExecutionProvider
is not working even if I set it in?:
optimized_model = ORTModelForSeq2SeqLM.from_pretrained(
onnx_path,
encoder_file_name="encoder_model_optimized.onnx",
decoder_file_name="decoder_model_optimized.onnx",
decoder_file_with_past_name="decoder_with_past_model_optimized.onnx",
provider="CUDAExecutionProvider"
)
What could be caused that? I saw in https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html that CUDA 11.6 is not mentionned, could it be this?
Issue Analytics
- State:
- Created a year ago
- Comments:12 (7 by maintainers)
Top GitHub Comments
Hi @soocheolnoh
The fix has been done, there was a bug on the output population, thanks for pointing it out.
Now you shall get the same translation result w/. or w/o. IOBinding.
Also, share some performance numbers tested with the previous snippet here: (PyTorch V.S Optimum, T4, warm_up_steps=10, loop=100, num_beam=5, max_length=256)
The issue is closed, but feel free to reopen it or ping me if you have extra questions about IOBinding. @soocheolnoh @Matthieu-Tinycoaching Thanks again for helping us improve Optimum.
Hi @soocheolnoh, thanks for testing.
From my side, for the mt5 model, the generated results are different w/. V.S. w/o. IO binding, which is not normal as IO Binding is not supposed to change the result(only the place to put the data should be different), it might be a bug on the post-processing of the outputs. I will take a closer look, and fix the beam search ASAP.