Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

What are the things to do to make inference faster or more responsive?

See original GitHub issue

I just followed the most basic tutorial of putting up an inference server with my TensorRT model plan. The model itself runs ~55ms if I use it directly with TRT, but it takes 6.2 seconds when I request it through Triton API.

Here is the model configuration file:

name: "default"
platform: "tensorrt_plan"
input {
  name: "input"
  data_type: TYPE_FP32
  dims: 1
  dims: 6
  dims: 320
  dims: 640
}
output {
  name: "output"
  data_type: TYPE_FP32
  dims: 1
  dims: 2
  dims: 320
  dims: 640
}

Here is the request test script:

import requests
import time
import numpy as np

if __name__ == "__main__":

    input_data = np.random.rand(1, 6, 320, 640)
    request_data = {
    "inputs": [{
        "name": "input",
        "shape": [1, 6, 320, 640],
        "datatype": "FP32",
        "data": input_data.tolist()
    }],
    "outputs": [{"name": "output"}]
}
    for i in range(100):
        start = time.time()    
        res = requests.post(url="http://localhost:8000/v2/models/stereo/versions/1/infer",json=request_data).json()
        print("time: ", time.time()-start)

The model plan itself is FP16, but when I set the request data type as FP16, it gives me error.

6 seconds compare to 55ms is huge difference, I don’t know where that difference comes from, because the server is on the same computer, so there should not be too much communication overhead. Someone suggested me to do the memory sharing, but I will eventually move the server to another computer, so there is no reason to test that.

by doing the Model_Analyzer, it gives me the following report:

detailed_report.pdf

which shows the p99 latency is only 125ms.

Please share with me your thoughts or even point me to the right place for examples.

Thanks!

Issue Analytics

State:
Created 10 months ago
Comments:7 (3 by maintainers)

Top GitHub Comments

2reactions

jbkyang-nvicommented, Dec 10, 2022

It should be faster to send data over as a binary blob with the client libraries instead of a json. Is there a reason you are not using the Triton client?

1reaction

deephogcommented, Nov 29, 2022

And also I did the test with the newest docker image of triton provided by Nvidia.

Top Results From Across the Web

Follow 5 Steps to Make an Inference - Smekens Education

Inferences are made by putting multiple clues together. Group the following details: jumping up and down, moving around, and rubbing and huffing ...

8 Activities to Build Inference Skills

8 Activities to Build Inference Skills ... When you ask students to describe a character's traits, determine the theme of a story, examine...

4 Quick Tips When Teaching Making Inferences

Tip #1: Use pictures to teach students how to make inferences. I love using pictures to teach a lot of comprehension skills, but...

Inference | Classroom Strategies - Reading Rockets

Helping students understand when information is implied, or not directly stated, will improve their skill in drawing conclusions and making inferences.

Neural Network Inference Optimization/Acceleration

The following is an attempt to capture the main essences of inference optimization. Being able to do inference as quickly as possible is...