Avg request latency to Avg HTTP time up 5000 usec
See original GitHub issueDescription A clear and concise description of what the bug is.
- I deployed the onnx model of yolov5 in triton and optimized it with tensorrt, and I tested the tensorrt model of yolov5 in other places. Its inference time is close to Avg request latency, but the increased time of Avg HTTP time is a big problem. I want to know how I can optimize this part of the time。
Triton Information What version of Triton are you using?
- 21.06.1
Are you using the Triton container or did you build it yourself?
- using the Triton container
To Reproduce Steps to reproduce the behavior.
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
platform: "onnxruntime_onnx"
max_batch_size: 8
default_model_filename: "model.onnx"
input [
{
name: "images"
data_type: TYPE_FP32
dims: [ 3, -1, -1]
}
]
output [
{
name: "output"
data_type: TYPE_FP32
dims: [ -1,8 ]
}
]
instance_group [
{
count: 1
gpus:[ 0 ]
}
]
dynamic_batching { }
model_warmup [
{
name: "warmup"
batch_size: 8
inputs: {
key: "images"
value: {
data_type: TYPE_FP32
dims: [ 3, 512, 480 ]
random_data : True
}
}
}
]
optimization { execution_accelerators {
gpu_execution_accelerator : [ {
name : "tensorrt"
}]
}}
perf_analyzer -m det_onnx --shape images:3,512,480 --concurrency-range 1 --percentile=95
*** Measurement Settings ***
Batch size: 1
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Using synchronous calls for inference
Stabilizing using p95 latency
Request concurrency: 1
Client:
Request count: 217
Throughput: 43.4 infer/sec
p50 latency: 23259 usec
p90 latency: 24276 usec
p95 latency: 24538 usec
p99 latency: 24760 usec
Avg HTTP time: 23013 usec (send/recv 1968 usec + response wait 21045 usec)
Server:
Inference count: 261
Execution count: 261
Successful request count: 261
Avg request latency: 16524 usec (overhead 51 usec + queue 91 usec + compute input 503 usec + compute infer 15733 usec + compute output 146 usec)
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 43.4 infer/sec, latency 24538 usec
Expected behavior A clear and concise description of what you expected to happen.
- I hope that the request time of the model deployed by triton is close to the inference time of my model.
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (5 by maintainers)
Top Results From Across the Web
The overhead cost so much time #2137 - GitHub
With your settings, your actual Avg request latency should be less than Avg HTTP time which is 12744us. This entails a maximum overhead...
Read more >API Response Time, Explained in 1000 Words or Less
A client will tend to experience more latency when requesting an API server that's 5,000 miles away than 500 miles away. Here's a...
Read more >Documentation - Apache Kafka
This config specifies the time, in milliseconds, that the GroupCoordinator will delay the initial consumer rebalance. The rebalance will be further delayed by ......
Read more >Network latency and its effect on application performance
Network latency is the time it takes for a packet to travel across the network. It's usually measured and reported as the Round...
Read more >Troubleshoot high latency on a DynamoDB table - AWS
I'm seeing an increase in the response time for Amazon DynamoDB requests. Resolution. To troubleshoot high latency on your DynamoDB table, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The send/recv is client side timing. There is no way for client to know when the data get’s completely delivered to the server…
<client sends data(T1)><data on network(T2)><server receives data(T3)><server process data(T4)><server sends data(T5)><data on network(T6)><client receives data(T7)>
send/recv is
T1 + T7
. The response wait time isT2 + T3 + T4 + T5 + T6
.Now, there can be some overlap between T1, T2 and T3. And same goes for T5, T6 and T7 but they can not entirely overlap. Larger the data, lesser is the overlap.
XXXX is
T2 + T3 + T5 + T6
. In case of the shared memory, the data is not transferred over the network but passed within a shared memory region making request and response json to be much smaller and hence X to be smaller as a result.You must use --output-shared-memory-size described here: https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/main.cc#L532
If the client and the server are running on the same machine, you can try using the shared memory and CUDA memory pools. HTTP/GRPC compression might also help reduce the networking time. (cc @tanmayv25 / @GuanLuo to correct me if I’m wrong here).
You can find example clients for CUDA memory and system shared memory in the links below:
https://github.com/triton-inference-server/client/blob/r21.07/src/python/examples/simple_http_cudashm_client.py https://github.com/triton-inference-server/client/blob/r21.07/src/python/examples/simple_http_shm_client.py