Compute infer time increases linearly with batch size even with batching
See original GitHub issueDescription Request latency and compute infer time of out-of-the-box inception_graphdef running on triton server increases linearly with batch size. With batching, the compute infer time is supposed to stay close to constant with different batch sizes instead of increasing linearly. Enabling/disabling dynamic batching doesn’t change the behavior.
Triton Information What version of Triton are you using? nvcr.io/nvidia/tritonserver:21.09-py3
Are you using the Triton container or did you build it yourself? I used the Triton container without any modification
To Reproduce Steps to reproduce the behavior.
- Run a triton server on AWS g4dn.2xlarge instance with T4 GPU. Using the following docker-compose.yaml
version: '3.7'
services:
inferenceserver:
image: nvcr.io/nvidia/tritonserver:21.09-py3
command: tritonserver --model-repository=/models --model-control-mode=explicit --exit-on-error=false
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]
volumes:
- /model_repository:/models
ports:
- 8001:8001
- 8002:8002
- 8000:8000
- Load the inception_graphdef model that comes out-of-the-box with this repo. This is the model config:
{
"name": "inception_graphdef",
"platform": "tensorflow_graphdef",
"backend": "tensorflow",
"version_policy": {
"latest": {
"num_versions": 1
}
},
"max_batch_size": 128,
"input": [
{
"name": "input",
"data_type": "TYPE_FP32",
"format": "FORMAT_NHWC",
"dims": [
299,
299,
3
],
"is_shape_tensor": false,
"allow_ragged_batch": false
}
],
"output": [
{
"name": "InceptionV3/Predictions/Softmax",
"data_type": "TYPE_FP32",
"dims": [
1001
],
"label_filename": "inception_labels.txt",
"is_shape_tensor": false
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"priority": "PRIORITY_DEFAULT",
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
},
"gather_kernel_buffer_threshold": 0,
"eager_batching": false
},
"instance_group": [
{
"name": "inception_graphdef",
"kind": "KIND_GPU",
"count": 1,
"gpus": [
0
],
"secondary_devices": [],
"profile": [],
"passive": false,
"host_policy": ""
}
],
"default_model_filename": "model.graphdef",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {},
"model_warmup": []
}
- Use perf_analyzer to benchmark the model. Below is the result of batch sizes from 1 to 128. The time increases linearly
root@d802522be317:/workspace# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 1 --measurement-mode count_windows --measurement-request-count=10 -u 10.3.12.37:8001
*** Measurement Settings ***
Batch size: 1
Using "count_windows" mode for stabilization
Minimum number of samples in each window: 10
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 76
Throughput: 75.9241 infer/sec
Avg latency: 13137 usec (standard deviation 871 usec)
p50 latency: 12867 usec
p90 latency: 13907 usec
p95 latency: 14462 usec
p99 latency: 16183 usec
Avg gRPC time: 13116 usec ((un)marshal request/response 326 usec + response wait 12790 usec)
Server:
Inference count: 76
Execution count: 76
Successful request count: 76
Avg request latency: 10140 usec (overhead 94 usec + queue 24 usec + compute input 183 usec + compute infer 9827 usec + compute output 12 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 75.9241 infer/sec, latency 13137 usec
root@d802522be317:/workspace# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 2 --measurement-mode count_windows --measurement-request-count=10 -u 10.3.12.37:8001
*** Measurement Settings ***
Batch size: 2
Using "count_windows" mode for stabilization
Minimum number of samples in each window: 10
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 52
Throughput: 103.896 infer/sec
Avg latency: 19165 usec (standard deviation 1285 usec)
p50 latency: 18865 usec
p90 latency: 20444 usec
p95 latency: 21123 usec
p99 latency: 22487 usec
Avg gRPC time: 19144 usec ((un)marshal request/response 470 usec + response wait 18674 usec)
Server:
Inference count: 104
Execution count: 52
Successful request count: 52
Avg request latency: 12955 usec (overhead 93 usec + queue 26 usec + compute input 303 usec + compute infer 12522 usec + compute output 10 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 103.896 infer/sec, latency 19165 usec
root@d802522be317:/workspace# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 4 --measurement-mode count_windows --measurement-request-count=10 -u 10.3.12.37:8001
*** Measurement Settings ***
Batch size: 4
Using "count_windows" mode for stabilization
Minimum number of samples in each window: 10
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 31
Throughput: 124 infer/sec
Avg latency: 32096 usec (standard deviation 1606 usec)
p50 latency: 31647 usec
p90 latency: 34947 usec
p95 latency: 35169 usec
p99 latency: 35615 usec
Avg gRPC time: 32072 usec ((un)marshal request/response 941 usec + response wait 31131 usec)
Server:
Inference count: 124
Execution count: 31
Successful request count: 31
Avg request latency: 19304 usec (overhead 90 usec + queue 24 usec + compute input 573 usec + compute infer 18606 usec + compute output 10 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 124 infer/sec, latency 32096 usec
root@d802522be317:/workspace# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 8 --measurement-mode count_windows --measurement-request-count=10 -u 10.3.12.37:8001
*** Measurement Settings ***
Batch size: 8
Using "count_windows" mode for stabilization
Minimum number of samples in each window: 10
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 17
Throughput: 136 infer/sec
Avg latency: 57567 usec (standard deviation 1818 usec)
p50 latency: 57267 usec
p90 latency: 60173 usec
p95 latency: 60729 usec
p99 latency: 61739 usec
Avg gRPC time: 57660 usec ((un)marshal request/response 1498 usec + response wait 56162 usec)
Server:
Inference count: 144
Execution count: 18
Successful request count: 18
Avg request latency: 32837 usec (overhead 137 usec + queue 38 usec + compute input 1584 usec + compute infer 31054 usec + compute output 22 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 136 infer/sec, latency 57567 usec
root@d802522be317:/workspace# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 16 --measurement-mode count_windows --measurement-request-count=10 -u 10.3.12.37:8001
*** Measurement Settings ***
Batch size: 16
Using "count_windows" mode for stabilization
Minimum number of samples in each window: 10
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 18
Throughput: 144 infer/sec
Avg latency: 109309 usec (standard deviation 3746 usec)
p50 latency: 107947 usec
p90 latency: 114506 usec
p95 latency: 116352 usec
p99 latency: 117348 usec
Avg gRPC time: 109288 usec ((un)marshal request/response 3022 usec + response wait 106266 usec)
Server:
Inference count: 288
Execution count: 18
Successful request count: 18
Avg request latency: 59491 usec (overhead 164 usec + queue 34 usec + compute input 3309 usec + compute infer 55956 usec + compute output 27 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 144 infer/sec, latency 109309 usec
root@d802522be317:/workspace# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 32 --measurement-mode count_windows --measurement-request-count=10 -u 10.3.12.37:8001
*** Measurement Settings ***
Batch size: 32
Using "count_windows" mode for stabilization
Minimum number of samples in each window: 10
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 13
Throughput: 138.62 infer/sec
Avg latency: 222231 usec (standard deviation 24533 usec)
p50 latency: 210092 usec
p90 latency: 267586 usec
p95 latency: 267586 usec
p99 latency: 287637 usec
Avg gRPC time: 222213 usec ((un)marshal request/response 7342 usec + response wait 214871 usec)
Server:
Inference count: 416
Execution count: 13
Successful request count: 13
Avg request latency: 115993 usec (overhead 239 usec + queue 37 usec + compute input 6783 usec + compute infer 108889 usec + compute output 43 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 138.62 infer/sec, latency 222231 usec
root@d802522be317:/workspace# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 64 --measurement-mode count_windows --measurement-request-count=10 -u 10.3.12.37:8001
*** Measurement Settings ***
Batch size: 64
Using "count_windows" mode for stabilization
Minimum number of samples in each window: 10
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 11
Throughput: 140.772 infer/sec
Avg latency: 433568 usec (standard deviation 4913 usec)
p50 latency: 433047 usec
p90 latency: 440276 usec
p95 latency: 442303 usec
p99 latency: 442303 usec
Avg gRPC time: 433547 usec ((un)marshal request/response 13015 usec + response wait 420532 usec)
Server:
Inference count: 704
Execution count: 11
Successful request count: 11
Avg request latency: 265554 usec (overhead 313 usec + queue 43 usec + compute input 48010 usec + compute infer 217111 usec + compute output 77 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 140.772 infer/sec, latency 433568 usec
root@d802522be317:/workspace# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 128 --measurement-mode count_windows --measurement-request-count=10 -u 10.3.12.37:8001
*** Measurement Settings ***
Batch size: 128
Using "count_windows" mode for stabilization
Minimum number of samples in each window: 10
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 9
Throughput: 127.986 infer/sec
Avg latency: 915560 usec (standard deviation 25010 usec)
p50 latency: 912273 usec
p90 latency: 930086 usec
p95 latency: 969772 usec
p99 latency: 969772 usec
Avg gRPC time: 911290 usec ((un)marshal request/response 27866 usec + response wait 883424 usec)
Server:
Inference count: 1280
Execution count: 10
Successful request count: 10
Avg request latency: 521936 usec (overhead 364 usec + queue 47 usec + compute input 95572 usec + compute infer 425814 usec + compute output 139 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 127.986 infer/sec, latency 915560 usec
Expected behavior I expect the compute infer time in the perf_analyzer result to stay roughly constant as long as the batch size is smaller than the max batch size of 128.
Issue Analytics
- State:
- Created a year ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
Turned out batching doesn’t work as I thought. Batching doesn’t mean the GPU will process all images at the same time. The GPU only has fewer than 10k cores and it’s normal for a layer to have 100M multiplications for each image so the GPU still has to break each layer into smaller parts. The GPU will do all multiplications in the part in parallel and then move to the next part. Adding images to a batch will create more parts, thus increasing the time almost linearly. Triton works well, there’s no problem with it. Thanks @tanmayv25 for helping me out.
Thank you @tanmayv25 . I tried
--shared-memory=cuda
but the result was still the same. Below is the result. I think the actual infer time (matrix multiplication, excluding moving data) should be less than linear and that time should be significant enough that it bends the graph to sub-linear instead of linear. Do you know what else may be happening here?