Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inference performance drop 22X on GPU hardware with optimum[onnxruntime-gpu] (compared with transformer)

See original GitHub issue

System Info

pip list:
onnx               1.12.0
onnxruntime-gpu    1.12.1
optimum            1.3.0
...
transformers       4.21.2

docker image: 
nvcr.io/nvidia/tensorrt:22.07-py3

gpu type:
Tesla V100-SXM2-16GB

Who can help?

@JingyaHuang @echarlaix

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

create docker container docker run --gpus all -it --name=tensorrtx -d --rm -v /home/ytseng/models:/models nvcr.io/nvidia/tensorrt:22.07-py3
install required python package ln -s /usr/local/bin/pip /usr/bin/pip && pip install optimum[onnxruntime-gpu] transformers
save huggingface model and export to onnx cmd: python save_and_export_model.py

transformer_model = "/models/optimum_m2m100/m2m100_transformer"
onnx_model = "/models/optimum_m2m100/m2m100_onnx"


def download_transformer_model():
    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

    # Load tokenizer and PyTorch weights form the Hub
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
    pt_model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

    # Save to disk
    tokenizer.save_pretrained(transformer_model)
    pt_model.save_pretrained(transformer_model)


def export_to_onnx_model():
    """
    -------- export to onnx
    """
    from optimum.onnxruntime import ORTModelForSeq2SeqLM

    # Load a model from transformers and export it through the ONNX format
    model = ORTModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_transformers=True)

    # Save the onnx model
    model.save_pretrained(onnx_model)


if __name__ == '__main__':
    download_transformer_model()
    export_to_onnx_model()

run translation on transformer api and optimum api testing cmd: python run_translation.py trans ; echo '---------------' ; python run_translation.py opt

# -*- coding: utf-8 -*-

import time, logging

import torch, os
import sys
from transformers import AutoTokenizer


logging.basicConfig(level=logging.INFO)


model_checkpoint = "facebook/m2m100_418M"
transformer_model = "/models/optimum_m2m100/m2m100_transformer"
onnx_model = "/models/optimum_m2m100/m2m100_onnx"

loop = 5

if len(sys.argv) != 2:
    logging.error("call with python run_translation.py <backend>")
    exit(1)

chinese_text = "机器学习是人工智能的一个分支。人工智能的研究历史有着一条从以“推理”为重点，到以“知识”为重点，再到以“学习”为重点的自然、清晰的脉络。显然，机器学习是实现人工智能的一个途径，即以机器学习为手段解决人工智能中的问题。机器学习在近30多年已发展为一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析（英语：Convex analysis）、计算复杂性理论等多门学科。机器学习理论主要是设计和分析一些让计算机可以自动“学习”的算法。机器学习算法是一类从数据中自动分析获得规律，并利用规律对未知数据进行预测的算法。因为学习算法中涉及了大量的统计学理论，机器学习与推断统计学联系尤为密切，也被称为统计学习理论。算法设计方面，机器学习理论关注可以实现的，行之有效的学习算法。很多推论问题属于无程序可循难度，所以部分的机器学习研究是开发容易处理的近似算法。机器学习已广泛应用于数据挖掘、计算机视觉、自然语言处理、生物特征识别、搜索引擎、医学诊断、检测信用卡欺诈、证券市场分析、DNA序列测序、语音和手写识别、战略游戏和机器人等领域"
logging.info(f"chinese_text is {chinese_text}")
logging.info(f"chinese_text length is {len(chinese_text)}")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
logging.info(f"This test will use device: {device}")


def get_transformer_model():
    from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

    model = M2M100ForConditionalGeneration.from_pretrained(transformer_model).to(device)
    tokenizer = M2M100Tokenizer.from_pretrained(transformer_model)
    return (model, tokenizer)


def get_optimum_onnx_model():
    from optimum.onnxruntime import ORTModelForSeq2SeqLM

    model = ORTModelForSeq2SeqLM.from_pretrained(onnx_model).to(device)
    tokenizer = AutoTokenizer.from_pretrained(transformer_model)
    return (model, tokenizer)


def translate():
    if str(sys.argv[1]) == "trans":
        (model, tokenizer) = get_transformer_model()
    elif str(sys.argv[1]) == "opt":
        (model, tokenizer) = get_optimum_onnx_model()
    else:
        logging.info("backend type is illegal, should be 'opt' or 'trans'")
        exit(1)

    start = time.time()
    for i in range(loop):
        encoded_zh = tokenizer(chinese_text, return_tensors="pt").to(device)
        generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
        result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
        logging.info(f"#{i}: {result}")
    end = time.time()
    total_time = end - start

    logging.info(f"total: {total_time}")
    logging.info(f"loop: {loop}")
    logging.info(f"avg(s): {total_time / loop}")
    logging.info(f"throughput(translation/s): {loop / total_time}")


if __name__ == '__main__':
    translate()

Expected behavior

We expected that the performance results are closed between the transformer backend and optimum[onnxruntime-gpu] backend. But it turns out that optimum is 22X slower than transformer one. We’re not sure if it’s related to a latest reported bug #362 , since we noticed that they used seq2seq transformer model with GPU hardware as well.

Here’s some output log for no.3 step in previous Reproduction sections.

root@ec44f06bb0bf:/workspace# python run_translation.py trans ; echo '---------------' ; python run_translation.py opt
INFO:root:chinese_text is ...
INFO:root:chinese_text length is 460
INFO:root:This test will use device: cuda
INFO:root:Transformer backend is enabled in this test
/usr/local/lib/python3.8/dist-packages/transformers/generation_utils.py:1202: UserWarning: Neither `max_length` nor `max_new_tokens` have been set, `max_length` will default to 200 (`self.config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
INFO:root:#0: ...
INFO:root:total: 13.273212909698486
INFO:root:loop: 5
INFO:root:avg(s): 2.6546425819396973
INFO:root:throughput(translation/s): 0.3766985457113096
---------------
INFO:root:chinese_text is ...
INFO:root:chinese_text length is 460
INFO:root:This test will use device: cuda
INFO:root:Optimum backend is enabled in this test
/usr/local/lib/python3.8/dist-packages/transformers/generation_utils.py:1202: UserWarning: Neither `max_length` nor `max_new_tokens` have been set, `max_length` will default to 200 (`self.config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
INFO:root:#0: ...
INFO:root:total: 303.64781045913696
INFO:root:loop: 5
INFO:root:avg(s): 60.72956209182739
INFO:root:throughput(translation/s): 0.016466445097824505

Issue Analytics

State:
Created a year ago
Reactions:4
Comments:11 (4 by maintainers)

Top GitHub Comments

3reactions

bugzyzcommented, Aug 31, 2022

Also, we did some nsight GPU profiling on these execution and got some strange finding. We hope these will help.

no.1: optimum took lots of time on data copy optimum GPU memops summary (time) transformer GPU memops summary (time)

no.2 optimum had large size of data copy optimum GPU memops summary (size) transformer GPU memops summary (size)

no.3: the kernel didn’t take too much time for both (compared with time cost by the data copy between host and device) optimum top GPU kernels cost - CUDA summary (API/Kernels/Memops) transformer top GPU kernels cost - CUDA summary (API/Kernels/Memops)

2reactions

fxmartycommented, Nov 9, 2022

Thanks, can reproduce and will fix the bug shortly. Our nightly workflows for GPU appear to have been broken, will try to fix as well.

Top Results From Across the Web

Accelerated Inference with Optimum and Transformers Pipelines

Inference has landed in Optimum with support for Hugging Face Transformers pipelines, including text-generation using ONNX Runtime.

Optimizing Transformers for GPUs with Optimum - philschmid

Convert a Hugging Face Transformers model to ONNX for inference; Optimize model for GPU using ORTOptimizer; Evaluate the performance and speed.

Accelerate Transformer inference on GPU with Optimum and ...

In this video, I show you how to accelerate Transformer inference with Optimum, an open-source library by Hugging Face, and Better ...

Developing Inference-Efficient Transformers for Vision Tasks

Following the success of transformer in natural language processing, many vision ... Senior Deep Learning Software Engineer – Autonomous Vehicles, NVIDIA.

Enabling Efficient Inference of Transformer Models at ... - arXiv

better per GPU efficiency than DeepSpeed Transformer by ... bandwidth to achieve comparable latency during inference.