question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

What are the things to do to make inference faster or more responsive?

See original GitHub issue

I just followed the most basic tutorial of putting up an inference server with my TensorRT model plan. The model itself runs ~55ms if I use it directly with TRT, but it takes 6.2 seconds when I request it through Triton API.

Here is the model configuration file:

name: "default"
platform: "tensorrt_plan"
input {
  name: "input"
  data_type: TYPE_FP32
  dims: 1
  dims: 6
  dims: 320
  dims: 640
}
output {
  name: "output"
  data_type: TYPE_FP32
  dims: 1
  dims: 2
  dims: 320
  dims: 640
}

Here is the request test script:

import requests
import time
import numpy as np

if __name__ == "__main__":

    input_data = np.random.rand(1, 6, 320, 640)
    request_data = {
    "inputs": [{
        "name": "input",
        "shape": [1, 6, 320, 640],
        "datatype": "FP32",
        "data": input_data.tolist()
    }],
    "outputs": [{"name": "output"}]
}
    for i in range(100):
        start = time.time()    
        res = requests.post(url="http://localhost:8000/v2/models/stereo/versions/1/infer",json=request_data).json()
        print("time: ", time.time()-start)

The model plan itself is FP16, but when I set the request data type as FP16, it gives me error.

6 seconds compare to 55ms is huge difference, I don’t know where that difference comes from, because the server is on the same computer, so there should not be too much communication overhead. Someone suggested me to do the memory sharing, but I will eventually move the server to another computer, so there is no reason to test that.

by doing the Model_Analyzer, it gives me the following report:

detailed_report.pdf

which shows the p99 latency is only 125ms.

Please share with me your thoughts or even point me to the right place for examples.

Thanks!

Issue Analytics

  • State:open
  • Created 10 months ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
jbkyang-nvicommented, Dec 10, 2022

It should be faster to send data over as a binary blob with the client libraries instead of a json. Is there a reason you are not using the Triton client?

1reaction
deephogcommented, Nov 29, 2022

And also I did the test with the newest docker image of triton provided by Nvidia.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Follow 5 Steps to Make an Inference - Smekens Education
Inferences are made by putting multiple clues together. Group the following details: jumping up and down, moving around, and rubbing and huffing ...
Read more >
8 Activities to Build Inference Skills
8 Activities to Build Inference Skills ... When you ask students to describe a character's traits, determine the theme of a story, examine...
Read more >
4 Quick Tips When Teaching Making Inferences
Tip #1: Use pictures to teach students how to make inferences. I love using pictures to teach a lot of comprehension skills, but...
Read more >
Inference | Classroom Strategies - Reading Rockets
Helping students understand when information is implied, or not directly stated, will improve their skill in drawing conclusions and making inferences.
Read more >
Neural Network Inference Optimization/Acceleration
The following is an attempt to capture the main essences of inference optimization. Being able to do inference as quickly as possible is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found