openvino_backend has much lower performance than tensorFlow_backend
See original GitHub issueDescription I have a recommended model, which can be loaded normally using TensorFlow backend, but when using OV loading, it indicates that the operator is not supported, so I implemented the operator that OV does not support and encapsulated the custom operator into a shared library. now OV can also load the model normally. However, in the performance test, it was found that the performance of OV was much lower than that of TensorFlow. The results are as follows: tensorflow_backend:
root@test:/data/xxxx# ./perf_client -a -b 600 -u localhost:8001 -i gRPC -m lat_tf --concurrency-range 1
*** Measurement Settings ***
Batch size: 600
Measurement window: 5000 msec
Using asynchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 157
Throughput: 18840 infer/sec
Avg latency: 31705 usec (standard deviation 2264 usec)
p50 latency: 31287 usec
p90 latency: 34560 usec
p95 latency: 36016 usec
p99 latency: 37818 usec
Avg gRPC time: 31804 usec ((un)marshal request/response 2205 usec + response wait 29599 usec)
Server:
Inference count: 113400
Execution count: 189
Successful request count: 189
Avg request latency: 27745 usec (overhead 218 usec + queue 180 usec + compute input 647 usec + compute infer 26637 usec + compute output 63 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 18840 infer/sec, latency 31705 usec
ov backend:
root@test:/data/xxxxx# ./perf_client -a -b 600 -u localhost:8001 -i gRPC -m lat_openvino --concurrency-range 1
*** Measurement Settings ***
Batch size: 600
Measurement window: 5000 msec
Using asynchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 35
Throughput: 4200 infer/sec
Avg latency: 142570 usec (standard deviation 11775 usec)
p50 latency: 141110 usec
p90 latency: 157490 usec
p95 latency: 158868 usec
p99 latency: 171043 usec
Avg gRPC time: 142093 usec ((un)marshal request/response 992 usec + response wait 141101 usec)
Server:
Inference count: 25200
Execution count: 42
Successful request count: 42
Avg request latency: 140174 usec (overhead 138 usec + queue 35 usec + compute input 683 usec + compute infer 138491 usec + compute output 827 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 4200 infer/sec, latency 142570 usec
Triton Information What version of Triton are you using? 21.06
Are you using the Triton container or did you build it yourself? yes
I recompile ov backend with openvino_backend r21.06, openvino 2021.3. this my command:
cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install -DTRITON_BUILD_OPENVINO_VERSION=2021.3.394
-DTRITON_BUILD_CONTAINER_VERSION=21.06 ..
To Reproduce my code: https://drive.google.com/file/d/1WObxsMidRSEnkSr97P2bHUytX5UKoym_/view?usp=sharing it include repo(ov and tf) and custom op share lib. you can use Triton to load the repo, and the shared library is placed somewhere, and the shared library path in the OV model’s configuration file needs to be changed accordingly.
Expected behavior ov backend better than tf, who can help me? thank you very much.
Issue Analytics
- State:
- Created 2 years ago
- Comments:15 (5 by maintainers)
Top GitHub Comments
yes,model.xml model.bin had created from tf savedmodel in my repo
Get Outlook for Android
We don’t officially support openVINO 2021.4 and onnxruntime_backend is still at 2021.2. But you can use our build.py script to build openvino backends for multiple versions. https://github.com/triton-inference-server/server/blob/main/build.py#L62-L83
Also, this ticket compares the performance of OV backend with TF backend for the model. Just so that we can better track the issue, can you create a new issue which describes the how to reproduce the perf regression between triton and intel’s model server? I think that would be an interesting take. Please make sure you are using same library versions.