Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

openvino_backend has much lower performance than tensorFlow_backend

See original GitHub issue

Description I have a recommended model, which can be loaded normally using TensorFlow backend, but when using OV loading, it indicates that the operator is not supported, so I implemented the operator that OV does not support and encapsulated the custom operator into a shared library. now OV can also load the model normally. However, in the performance test, it was found that the performance of OV was much lower than that of TensorFlow. The results are as follows: tensorflow_backend:

root@test:/data/xxxx# ./perf_client -a -b 600 -u localhost:8001 -i gRPC -m lat_tf --concurrency-range 1
*** Measurement Settings ***
  Batch size: 600
  Measurement window: 5000 msec
  Using asynchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client:
    Request count: 157
    Throughput: 18840 infer/sec
    Avg latency: 31705 usec (standard deviation 2264 usec)
    p50 latency: 31287 usec
    p90 latency: 34560 usec
    p95 latency: 36016 usec
    p99 latency: 37818 usec
    Avg gRPC time: 31804 usec ((un)marshal request/response 2205 usec + response wait 29599 usec)
  Server:
    Inference count: 113400
    Execution count: 189
    Successful request count: 189
    Avg request latency: 27745 usec (overhead 218 usec + queue 180 usec + compute input 647 usec + compute infer 26637 usec + compute output 63 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 18840 infer/sec, latency 31705 usec

ov backend:

root@test:/data/xxxxx# ./perf_client -a -b 600 -u localhost:8001 -i gRPC -m lat_openvino --concurrency-range 1
*** Measurement Settings ***
  Batch size: 600
  Measurement window: 5000 msec
  Using asynchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client:
    Request count: 35
    Throughput: 4200 infer/sec
    Avg latency: 142570 usec (standard deviation 11775 usec)
    p50 latency: 141110 usec
    p90 latency: 157490 usec
    p95 latency: 158868 usec
    p99 latency: 171043 usec
    Avg gRPC time: 142093 usec ((un)marshal request/response 992 usec + response wait 141101 usec)
  Server:
    Inference count: 25200
    Execution count: 42
    Successful request count: 42
    Avg request latency: 140174 usec (overhead 138 usec + queue 35 usec + compute input 683 usec + compute infer 138491 usec + compute output 827 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 4200 infer/sec, latency 142570 usec

Triton Information What version of Triton are you using? 21.06

Are you using the Triton container or did you build it yourself? yes

I recompile ov backend with openvino_backend r21.06, openvino 2021.3. this my command:

cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install -DTRITON_BUILD_OPENVINO_VERSION=2021.3.394
 -DTRITON_BUILD_CONTAINER_VERSION=21.06 ..

To Reproduce my code: https://drive.google.com/file/d/1WObxsMidRSEnkSr97P2bHUytX5UKoym_/view?usp=sharing it include repo(ov and tf) and custom op share lib. you can use Triton to load the repo, and the shared library is placed somewhere, and the shared library path in the OV model’s configuration file needs to be changed accordingly.

Expected behavior ov backend better than tf, who can help me? thank you very much.

Issue Analytics

State:
Created 2 years ago
Comments:15 (5 by maintainers)

Top GitHub Comments

1reaction

zhaohbcommented, Aug 4, 2021

yes，model.xml model.bin had created from tf savedmodel in my repo

Get Outlook for Android

0reactions

tanmayv25commented, Jan 7, 2022

We don’t officially support openVINO 2021.4 and onnxruntime_backend is still at 2021.2. But you can use our build.py script to build openvino backends for multiple versions. https://github.com/triton-inference-server/server/blob/main/build.py#L62-L83

Also, this ticket compares the performance of OV backend with TF backend for the model. Just so that we can better track the issue, can you create a new issue which describes the how to reproduce the perf regression between triton and intel’s model server? I think that would be an interesting take. Please make sure you are using same library versions.