Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow ONNX inference

See original GitHub issue

Hi, I am seeing slower inference times for an ONNX model on Triton 2.0.0 than when running the same model with an ONNX Runtime 1.3.0 InferenceSession. The Triton inference times seem very similar to the inference times I see when I force the InferenceSession to run on CPU.

Here are sample inference times I am seeing on Triton (same image 5 times):

TITAN Xp

4.24s
3.49s
3.41s
3.42s
3.41s

Here are sample inference times I am seeing on a Colab notebook with various GPUs (same image 5 times):

Tesla P100

CUDA
0.48s
0.07s
0.06s
0.07s
0.07s

CPU
2.95s
2.86s
2.85s
2.84s
2.92s

Tesla T4

CUDA
0.69s
0.13s
0.12s
0.12s
0.12s

CPU
2.58s
2.61s
2.60s
2.66s
2.64s

Tesla K80

CUDA
1.29s
0.26s
0.24s
0.22s
0.21s

CPU
3.35s
3.42s
3.36s
3.37s
3.38s

I am measuring the inference times on my Python client for Triton and on Colab by taking datetime differences from before and after each request:

Triton

start_time = datetime.datetime.now()
response = triton_client.infer(FLAGS.model_name,
    inputs,
    request_id=str(sent_count),
    model_version=FLAGS.model_version,
    outputs=outputs)
end_time = datetime.datetime.now()
print(f"triton inference time: {end_time - start_time}")

Colab

start_time = datetime.datetime.now()
onnx_detections = model_session.run(None, {input_name: np_img})
end_time = datetime.datetime.now()
print(f"onnx inference time: {end_time - start_time}")

I am using the same image for testing inference on Triton and on Colab.

Here is my config.pbtxt:

name: "yolov5"
platform: "onnxruntime_onnx"
version_policy {
    specific: {
        versions: [ 1 ]
    }
}
input [
    {
        name: "images"
        data_type: TYPE_FP32
        dims: [ 1, 3, 640, 640 ]
    }
]
output [
    {
        name: "output"
        data_type: TYPE_FP32
        dims: [ 1, 25200, 85 ]
    },
    {
        name: "751"
        data_type: TYPE_FP32
        dims: [ 1, 3, 80, 80, 85 ]
    },
    {
        name: "1034"
        data_type: TYPE_FP32
        dims: [ 1, 3, 40, 40, 85 ]
    },
    {
        name: "1317"
        data_type: TYPE_FP32
        dims: [ 1, 3, 20, 20, 85 ]
    }
]
instance_group [
    {
        kind: KIND_GPU
        count: 1
        gpus: [ 0 ]
    }
]
default_model_filename: "model.onnx"
cc_model_filenames [
    {
        key: "6.1"
        value: "model.onnx"
    }
]

I am not sure where this performance difference is coming from and am looking for any insight into whether this is expected or if there is something wrong with my Triton config/setup. Could ORT on Triton be running with the CPU provider?

Issue Analytics

State:
Created 3 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

deadeyegoodwincommented, Aug 5, 2020

We don’t have a general characterization for perf difference between python and C++ client, but I would guess that in some cases it could be considerable. Python isn’t expected to have performance on the level on C/C++.

Triton provides a Python library that uses GRPC, the API is similar to what you see in the Python HTTP client so it should be easy for you to try. Simple example here: https://github.com/NVIDIA/triton-inference-server/blob/master/src/clients/python/examples/simple_grpc_infer_client.py

GRPC also allows you to use the protoc compiler to create a client API for different languages if that is interesting to you. An example for golang is here: https://github.com/NVIDIA/triton-inference-server/tree/master/src/clients/go

1reaction

deadeyegoodwincommented, Aug 4, 2020

We’ve triaged the slowdown. A big part of the slowdown was fixed by #1865 and you should see that fix in the 20.08 release. The remaining slowdown in v2.0 vs. v1.13 is from a change in how the python client is implemented. In 1.x the python client interfaces with a custom C++ library whereas in v2.0 the python client is python only. Going from custom c++ library backend to a python-based (gevent) implementation is slower (perhaps that is not surprising). Having a “pure” python client is/was a common request so I don’t anticipate we will revert. If you really care about performance you should use the C++ client (or a grpc-generated library for your client language).