Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Speed difference ONNX vs TensorRT with samples sorted by sequence length

See original GitHub issue

I noticed something unexpected when comparing two scenarios for a model converted via ONNX and TensorRT (distilroberta with classification head):

Scenario: I use a dataset with varying sentence lengths (~20-60 tokens) and run it randomly sampled through both models
Scenario: I use the same dataset but sort the sentences by sentence length (decreasing) before running it through both models

Result: The TensorRT model does not seem to care about the sequence lengths and keeps the same speed for both scenarios. The ONNX model, however, gets almost twice as fast when I use the second scenario.

I was wondering if tensorRT’s optimization does somehow require to pad to the max length internally. I was searching for a parameter or a reason for this behavior but couldn’t find anything useful. For conversion, I set the seq-len parameter to 1 60 60.

I was wondering if perhaps someone else has already observed this and knows the reason / a solution.

Issue Analytics

State:
Created 2 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

v1nc3nt27commented, Mar 8, 2022

Hey @pommedeterresautee, sorry for the long wait - I was on a holiday trip.

I based my script on your demo scripts but I cannot disclose the model and/or dataset. You can basically use any dataset with 2 inputs, e.g. example for QA. I hope you can make use of it anyway.

I attached the script to call the inference assemble hosted in triton (transformer_onnx_inference or transformer_trt_inference) and the slightly modified model.py for the tokenize endpoint in triton.

If you experience the same what I do, then calling the ONNX model’s inference endpoint should be slower if you comment out the length sorting in triton_inference_qa_test.py and there should be no difference if you do the same for the trt model’s inference.

triton_inference_qa_test.py

import argparse
import math
import time
import numpy as np
import tritonclient.http
from tqdm import tqdm
from scipy.special import softmax
from transformer_deploy.benchmarks.utils import print_timings, setup_logging, track_infer_time


def _batch(iterable, n=1):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx:min(ndx + n, l)]


def run_http_sync():
    all_gold = []
    all_pred = []
    triton_client = tritonclient.http.InferenceServerClient(url=url, verbose=False)

    assert triton_client.is_model_ready(
        model_name=model_name, model_version=model_version
    ), f"model {model_name} not yet ready"
    setup_logging()

    model_score = tritonclient.http.InferRequestedOutput(name="output", binary_data=True)
    for b in tqdm(_batch(list(zip(questions, answers, golds)), batch_size), total=math.ceil(len(answers)/batch_size)):
        with track_infer_time(time_buffer):
            topic_b, sent_b, gold_b = zip(*b) 
            all_gold.extend(gold_b)

            query_sent = tritonclient.http.InferInput(name="sent", shape=(len(b),), datatype="BYTES")
            query_topic = tritonclient.http.InferInput(name="topic", shape=(len(b),), datatype="BYTES")

            query_sent.set_data_from_numpy(np.asarray(sent_b, dtype=object))
            query_topic.set_data_from_numpy(np.asarray(topic_b, dtype=object))

            response = triton_client.infer(
                model_name=model_name, model_version=model_version, inputs=[query_topic, query_sent], outputs=[model_score],
                response_compression_algorithm="gzip", request_compression_algorithm="gzip"
            )
            res = response.as_numpy("output")
            scores = softmax(res, axis=1)
            all_pred.extend([np.argmax(pred) for pred in scores])

    return all_gold, all_pred


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="require inference", formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )
    parser.add_argument("--triton-model-name", help="Model name in triton server", type=str)
    parser.add_argument("--url", help="", type=str, default="127.0.0.1:8000")
    parser.add_argument("--batch-size", help="", type=int, default=64)
    args, _ = parser.parse_known_args()

    setup_logging()
    model_name = args.triton_model_name
    url = args.url
    model_version = "1"
    batch_size = args.batch_size

    # todo read data
    data = read_data(...)

    answers = data["answers"].tolist()
    golds = data["label"].tolist()
    questions = data["questions"].tolist()
    t = time.time()

    # comment out the following 4 lines to switch off bucketing
    length_sorted_idx = np.argsort([len(q+a) for q, a in zip(questions, answers)])
    answers = [answers[idx] for idx in length_sorted_idx]
    golds = [golds[idx] for idx in length_sorted_idx]
    questions = [questions[idx] for idx in length_sorted_idx]
    time_buffer = list()

    all_gold, all_pred = run_http_sync()

    print_timings(name="triton transformers", timings=time_buffer)
    total_time = time.time() - t
    print("Total time: " + str(total_time))

model.py

import os
from typing import Dict, List

import numpy as np


try:
    # noinspection PyUnresolvedReferences
    import triton_python_backend_utils as pb_utils
except ImportError:
    pass  # triton_python_backend_utils exists only inside Triton Python backend.

from transformers import AutoTokenizer, PreTrainedTokenizer, TensorType


class TritonPythonModel:
    is_tensorrt: bool
    tokenizer: PreTrainedTokenizer

    def initialize(self, args: Dict[str, str]) -> None:
        """
        Initialize the tokenization process
        :param args: arguments from Triton config file
        """
        path: str = os.path.join(args["model_repository"], args["model_version"])
        model_name: str = args["model_name"]
        self.is_tensorrt = "trt" in model_name or "tensorrt" in model_name
        self.tokenizer = AutoTokenizer.from_pretrained(path)
        self.tokenizer.model_max_length = 60

    def execute(self, requests) -> "List[List[pb_utils.Tensor]]":
        """
        Parse and tokenize each request
        :param requests: 1 or more requests received by Triton server.
        :return: text as input tensors
        """
        responses = []
        # for loop for batch requests (disabled in our case)
        for request in requests:
            # binary data typed back to string
            query_topic = [t.decode("UTF-8") for t in pb_utils.get_input_tensor_by_name(request, "topic").as_numpy().tolist()]
            query_sent = [t.decode("UTF-8") for t in pb_utils.get_input_tensor_by_name(request, "sent").as_numpy().tolist()]

            tokens: Dict[str, np.ndarray] = self.tokenizer(query_topic, query_sent, return_tensors=TensorType.NUMPY,
                                                           padding="longest", truncation=True)
            if self.is_tensorrt:
                # tensorrt uses int32 as input type, ort uses int64
                tokens = {k: v.astype(np.int32) for k, v in tokens.items()}
            # communicate the tokenization results to Triton server
            outputs = list()
            for input_name in self.tokenizer.model_input_names:
                tensor_input = pb_utils.Tensor(input_name, tokens[input_name])
                outputs.append(tensor_input)

            inference_response = pb_utils.InferenceResponse(output_tensors=outputs)
            responses.append(inference_response)

        return responses

0reactions

pommedeterresauteecommented, Mar 3, 2022

can you provide me with some reproducible code so I test on my side?

Top Results From Across the Web

Speeding Up Deep Learning Inference Using TensorFlow ...

In this post, you learn how to deploy TensorFlow trained deep learning models using the new TensorFlow-ONNX-TensorRT workflow. This tutorial ...

Accelerated inference on NVIDIA GPUs - Hugging Face

This guide will show you how to run inference on two execution providers that ONNX Runtime supports for NVIDIA GPUs: CUDAExecutionProvider : Generic ......

ONNX Runtime Performance Tuning

TensorRT and CUDA are separate execution providers for ONNX Runtime. On the same hardware, TensorRT will generally provide better performance; however, this ...

An empirical approach to speedup your BERT inference with ...

Using Torchscript or ONNX does provide significant speedup for lower batch size and sequence length, the effect is particularly strong when ...

Optimizing and deploying transformer INT8 inference with ...

Figure 2: Compute latency comparison between ONNX Runtime-TensorRT and PyTorch for running BERT-Large on NVIDIA A100 GPU for sequence length 128 ...