Multi instances's performance is slightly low. [pytorch_backend]
See original GitHub issueHi.
I found that the multi instances’s performance is slightly low. My environment and experiment results is below.
Machine Information
GPU : TITAN RTX * 3ea
OS : ubuntu 18.04
Triton with pytorch_backend : r22.08
Model config
name: "resnet101"
platform: "pytorch_libtorch"
max_batch_size: 32
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ] or [ 0, 1 ]
}
]
input [
{
name: "INPUT__0"
data_type: TYPE_FP32
format: FORMAT_NCHW
dims: [3, 224, 224]
}
]
output [
{
name: "OUTPUT__0"
data_type: TYPE_FP32
dims: [1000]
label_filename: "resnet_labels.txt"
}
]
version_policy: { specific: {versions: [1]}}
perf_analyzer results ( Single Instance gpu: [ 0 ] )
*** Measurement Settings ***
Batch size: 1
Using "time_windows" mode for stabilization
Measurement window: 30000 msec
Latency limit: 0 msec
Request Rate limit: 200 requests per seconds
Using uniform distribution on request generation
Using asynchronous calls for inference
Stabilizing using average latency
Request Rate: 200 inference requests per seconds
Client:
Request count: 3245
Delayed Request Count: 537
Throughput: 90.1381 infer/sec
Avg latency: 9909994 usec (standard deviation 10169 usec)
p50 latency: 9906561 usec
p90 latency: 17777911 usec
p95 latency: 18744365 usec
p99 latency: 19588592 usec
Avg HTTP time: 9908353 usec (send/recv 2162 usec + response wait 9906191 usec)
Server:
Inference count: 3245
Execution count: 3245
Successful request count: 3245
Avg request latency: 9893671 usec (overhead 27 usec + queue 9882592 usec + compute input 191 usec + compute infer 10823 usec + compute output 37 usec)
Inferences/Second vs. Client Average Batch Latency
Request Rate: 200, throughput: 90.1381 infer/sec, latency 9909994 usec
perf_analyzer results ( Multi Instance gpu: [ 0, 1 ] )
*** Measurement Settings ***
Batch size: 1
Using "time_windows" mode for stabilization
Measurement window: 30000 msec
Latency limit: 0 msec
Request Rate limit: 200 requests per seconds
Using uniform distribution on request generation
Using asynchronous calls for inference
Stabilizing using average latency
Request Rate: 200 inference requests per seconds
Client:
Request count: 6230
Delayed Request Count: 5593
Throughput: 173.054 infer/sec
Avg latency: 1381404 usec (standard deviation 19187 usec)
p50 latency: 1344601 usec
p90 latency: 2542631 usec
p95 latency: 2859819 usec
p99 latency: 3037915 usec
Avg HTTP time: 1381659 usec (send/recv 536 usec + response wait 1381123 usec)
Server:
Inference count: 6231
Execution count: 6231
Successful request count: 6231
Avg request latency: 1359799 usec (overhead 26 usec + queue 1348261 usec + compute input 172 usec + compute infer 11301 usec + compute output 39 usec)
Inferences/Second vs. Client Average Batch Latency
Request Rate: 200, throughput: 173.054 infer/sec, latency 1381404 usec
Although the performance gap between single and multi-instance is slight, i think that the performance has to be twice when using two instances. When I use three instances the throughput of perf_analyzer’s results is 253.442 infer / sec. That means if I add an instance on another GPU, their performance drops about 8~10 infer / sec compared to a single instance.
Why does this happen?
Is it because of request queue contention for two instances? or something other contention that I can not realize?
Thanks for reading.
Issue Analytics
- State:
- Created 10 months ago
- Comments:7 (4 by maintainers)
Top GitHub Comments
Please also note that if you are running perf analyzer on the same machine as Triton, then they are competing for CPU resources. That can impact your results, even for models running on the GPU.
As for “how many instances should I use?”, that is a question best answered by Model Analyzer. It is a wrapper around Perf Analyzer that will run multiple experiments to find the best configuration under any constraints you give it (such as a maximum latency budget).
Understood. They don’t share GPU resources but they do share the network for receiving and responding to requests. I suspect @dyastremsky was on the right track with this statement: