What are the things to do to make inference faster or more responsive?
See original GitHub issueI just followed the most basic tutorial of putting up an inference server with my TensorRT model plan. The model itself runs ~55ms if I use it directly with TRT, but it takes 6.2 seconds when I request it through Triton API.
Here is the model configuration file:
name: "default"
platform: "tensorrt_plan"
input {
name: "input"
data_type: TYPE_FP32
dims: 1
dims: 6
dims: 320
dims: 640
}
output {
name: "output"
data_type: TYPE_FP32
dims: 1
dims: 2
dims: 320
dims: 640
}
Here is the request test script:
import requests
import time
import numpy as np
if __name__ == "__main__":
input_data = np.random.rand(1, 6, 320, 640)
request_data = {
"inputs": [{
"name": "input",
"shape": [1, 6, 320, 640],
"datatype": "FP32",
"data": input_data.tolist()
}],
"outputs": [{"name": "output"}]
}
for i in range(100):
start = time.time()
res = requests.post(url="http://localhost:8000/v2/models/stereo/versions/1/infer",json=request_data).json()
print("time: ", time.time()-start)
The model plan itself is FP16, but when I set the request data type as FP16, it gives me error.
6 seconds compare to 55ms is huge difference, I don’t know where that difference comes from, because the server is on the same computer, so there should not be too much communication overhead. Someone suggested me to do the memory sharing, but I will eventually move the server to another computer, so there is no reason to test that.
by doing the Model_Analyzer, it gives me the following report:
which shows the p99 latency is only 125ms.
Please share with me your thoughts or even point me to the right place for examples.
Thanks!
Issue Analytics
- State:
- Created 10 months ago
- Comments:7 (3 by maintainers)
Top GitHub Comments
It should be faster to send data over as a binary blob with the client libraries instead of a json. Is there a reason you are not using the Triton client?
And also I did the test with the newest docker image of triton provided by Nvidia.