Slow ONNX inference
See original GitHub issueHi, I am seeing slower inference times for an ONNX model on Triton 2.0.0 than when running the same model with an ONNX Runtime 1.3.0 InferenceSession. The Triton inference times seem very similar to the inference times I see when I force the InferenceSession to run on CPU.
Here are sample inference times I am seeing on Triton (same image 5 times):
TITAN Xp
4.24s
3.49s
3.41s
3.42s
3.41s
Here are sample inference times I am seeing on a Colab notebook with various GPUs (same image 5 times):
Tesla P100
CUDA
0.48s
0.07s
0.06s
0.07s
0.07s
CPU
2.95s
2.86s
2.85s
2.84s
2.92s
Tesla T4
CUDA
0.69s
0.13s
0.12s
0.12s
0.12s
CPU
2.58s
2.61s
2.60s
2.66s
2.64s
Tesla K80
CUDA
1.29s
0.26s
0.24s
0.22s
0.21s
CPU
3.35s
3.42s
3.36s
3.37s
3.38s
I am measuring the inference times on my Python client for Triton and on Colab by taking datetime differences from before and after each request:
Triton
start_time = datetime.datetime.now()
response = triton_client.infer(FLAGS.model_name,
inputs,
request_id=str(sent_count),
model_version=FLAGS.model_version,
outputs=outputs)
end_time = datetime.datetime.now()
print(f"triton inference time: {end_time - start_time}")
Colab
start_time = datetime.datetime.now()
onnx_detections = model_session.run(None, {input_name: np_img})
end_time = datetime.datetime.now()
print(f"onnx inference time: {end_time - start_time}")
I am using the same image for testing inference on Triton and on Colab.
Here is my config.pbtxt:
name: "yolov5"
platform: "onnxruntime_onnx"
version_policy {
specific: {
versions: [ 1 ]
}
}
input [
{
name: "images"
data_type: TYPE_FP32
dims: [ 1, 3, 640, 640 ]
}
]
output [
{
name: "output"
data_type: TYPE_FP32
dims: [ 1, 25200, 85 ]
},
{
name: "751"
data_type: TYPE_FP32
dims: [ 1, 3, 80, 80, 85 ]
},
{
name: "1034"
data_type: TYPE_FP32
dims: [ 1, 3, 40, 40, 85 ]
},
{
name: "1317"
data_type: TYPE_FP32
dims: [ 1, 3, 20, 20, 85 ]
}
]
instance_group [
{
kind: KIND_GPU
count: 1
gpus: [ 0 ]
}
]
default_model_filename: "model.onnx"
cc_model_filenames [
{
key: "6.1"
value: "model.onnx"
}
]
I am not sure where this performance difference is coming from and am looking for any insight into whether this is expected or if there is something wrong with my Triton config/setup. Could ORT on Triton be running with the CPU provider?
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (5 by maintainers)
Top GitHub Comments
We don’t have a general characterization for perf difference between python and C++ client, but I would guess that in some cases it could be considerable. Python isn’t expected to have performance on the level on C/C++.
Triton provides a Python library that uses GRPC, the API is similar to what you see in the Python HTTP client so it should be easy for you to try. Simple example here: https://github.com/NVIDIA/triton-inference-server/blob/master/src/clients/python/examples/simple_grpc_infer_client.py
GRPC also allows you to use the protoc compiler to create a client API for different languages if that is interesting to you. An example for golang is here: https://github.com/NVIDIA/triton-inference-server/tree/master/src/clients/go
We’ve triaged the slowdown. A big part of the slowdown was fixed by #1865 and you should see that fix in the 20.08 release. The remaining slowdown in v2.0 vs. v1.13 is from a change in how the python client is implemented. In 1.x the python client interfaces with a custom C++ library whereas in v2.0 the python client is python only. Going from custom c++ library backend to a python-based (gevent) implementation is slower (perhaps that is not surprising). Having a “pure” python client is/was a common request so I don’t anticipate we will revert. If you really care about performance you should use the C++ client (or a grpc-generated library for your client language).