question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow ONNX inference

See original GitHub issue

Hi, I am seeing slower inference times for an ONNX model on Triton 2.0.0 than when running the same model with an ONNX Runtime 1.3.0 InferenceSession. The Triton inference times seem very similar to the inference times I see when I force the InferenceSession to run on CPU.

Here are sample inference times I am seeing on Triton (same image 5 times):

TITAN Xp

4.24s
3.49s
3.41s
3.42s
3.41s

Here are sample inference times I am seeing on a Colab notebook with various GPUs (same image 5 times):

Tesla P100

CUDA
0.48s
0.07s
0.06s
0.07s
0.07s
CPU
2.95s
2.86s
2.85s
2.84s
2.92s

Tesla T4

CUDA
0.69s
0.13s
0.12s
0.12s
0.12s
CPU
2.58s
2.61s
2.60s
2.66s
2.64s

Tesla K80

CUDA
1.29s
0.26s
0.24s
0.22s
0.21s
CPU
3.35s
3.42s
3.36s
3.37s
3.38s

I am measuring the inference times on my Python client for Triton and on Colab by taking datetime differences from before and after each request:

Triton

start_time = datetime.datetime.now()
response = triton_client.infer(FLAGS.model_name,
    inputs,
    request_id=str(sent_count),
    model_version=FLAGS.model_version,
    outputs=outputs)
end_time = datetime.datetime.now()
print(f"triton inference time: {end_time - start_time}")

Colab

start_time = datetime.datetime.now()
onnx_detections = model_session.run(None, {input_name: np_img})
end_time = datetime.datetime.now()
print(f"onnx inference time: {end_time - start_time}")

I am using the same image for testing inference on Triton and on Colab.

Here is my config.pbtxt:

name: "yolov5"
platform: "onnxruntime_onnx"
version_policy {
    specific: {
        versions: [ 1 ]
    }
}
input [
    {
        name: "images"
        data_type: TYPE_FP32
        dims: [ 1, 3, 640, 640 ]
    }
]
output [
    {
        name: "output"
        data_type: TYPE_FP32
        dims: [ 1, 25200, 85 ]
    },
    {
        name: "751"
        data_type: TYPE_FP32
        dims: [ 1, 3, 80, 80, 85 ]
    },
    {
        name: "1034"
        data_type: TYPE_FP32
        dims: [ 1, 3, 40, 40, 85 ]
    },
    {
        name: "1317"
        data_type: TYPE_FP32
        dims: [ 1, 3, 20, 20, 85 ]
    }
]
instance_group [
    {
        kind: KIND_GPU
        count: 1
        gpus: [ 0 ]
    }
]
default_model_filename: "model.onnx"
cc_model_filenames [
    {
        key: "6.1"
        value: "model.onnx"
    }
]

I am not sure where this performance difference is coming from and am looking for any insight into whether this is expected or if there is something wrong with my Triton config/setup. Could ORT on Triton be running with the CPU provider?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
deadeyegoodwincommented, Aug 5, 2020

We don’t have a general characterization for perf difference between python and C++ client, but I would guess that in some cases it could be considerable. Python isn’t expected to have performance on the level on C/C++.

Triton provides a Python library that uses GRPC, the API is similar to what you see in the Python HTTP client so it should be easy for you to try. Simple example here: https://github.com/NVIDIA/triton-inference-server/blob/master/src/clients/python/examples/simple_grpc_infer_client.py

GRPC also allows you to use the protoc compiler to create a client API for different languages if that is interesting to you. An example for golang is here: https://github.com/NVIDIA/triton-inference-server/tree/master/src/clients/go

1reaction
deadeyegoodwincommented, Aug 4, 2020

We’ve triaged the slowdown. A big part of the slowdown was fixed by #1865 and you should see that fix in the 20.08 release. The remaining slowdown in v2.0 vs. v1.13 is from a change in how the python client is implemented. In 1.x the python client interfaces with a custom C++ library whereas in v2.0 the python client is python only. Going from custom c++ library backend to a python-based (gevent) implementation is slower (perhaps that is not surprising). Having a “pure” python client is/was a common request so I don’t anticipate we will revert. If you really care about performance you should use the C++ client (or a grpc-generated library for your client language).

Read more comments on GitHub >

github_iconTop Results From Across the Web

onnxruntime inference is around 5 times slower than pytorch ...
Describe the bug. Inference time of onnxruntime is 5x times slower as compared to the pytorch model on GPU BUT 2.5x times faster...
Read more >
Tune performance | onnxruntime
The ONNX Go Live “OLive” tool is a Python package that automates the process of accelerating models with ONNX Runtime (ORT). It contains...
Read more >
onnxruntime inference is way slower than pytorch on GPU
I was tryng this on Windows 10. ONNX Runtime installed from source - ONNX Runtime version: 1.11.0 (onnx version 1.10.1); Python version -...
Read more >
Scaling-up PyTorch inference: Serving billions of daily NLP ...
We're happy to see that the ONNX Runtime Machine Learning model inferencing solution we've built and use in high-volume Microsoft products and ...
Read more >
onnx model shows slower inference when setting cuda as ...
I have a net loaded from onnx: net = cv2.dnn.readNetFromONNX(xxx.onnx) when i directly do net.forward(), the inference time is around 0.2s ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found