question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Compute infer time increases linearly with batch size even with batching

See original GitHub issue

Description Request latency and compute infer time of out-of-the-box inception_graphdef running on triton server increases linearly with batch size. With batching, the compute infer time is supposed to stay close to constant with different batch sizes instead of increasing linearly. Enabling/disabling dynamic batching doesn’t change the behavior.

Triton Information What version of Triton are you using? nvcr.io/nvidia/tritonserver:21.09-py3

Are you using the Triton container or did you build it yourself? I used the Triton container without any modification

To Reproduce Steps to reproduce the behavior.

  • Run a triton server on AWS g4dn.2xlarge instance with T4 GPU. Using the following docker-compose.yaml
version: '3.7'
services:
  inferenceserver:
    image: nvcr.io/nvidia/tritonserver:21.09-py3
    command: tritonserver --model-repository=/models --model-control-mode=explicit --exit-on-error=false
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ['0']
            capabilities: [gpu]
    volumes:
      - /model_repository:/models
    ports:
      - 8001:8001
      - 8002:8002
      - 8000:8000
  • Load the inception_graphdef model that comes out-of-the-box with this repo. This is the model config:
{
    "name": "inception_graphdef",
    "platform": "tensorflow_graphdef",
    "backend": "tensorflow",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 128,
    "input": [
        {
            "name": "input",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NHWC",
            "dims": [
                299,
                299,
                3
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false
        }
    ],
    "output": [
        {
            "name": "InceptionV3/Predictions/Softmax",
            "data_type": "TYPE_FP32",
            "dims": [
                1001
            ],
            "label_filename": "inception_labels.txt",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "inception_graphdef",
            "kind": "KIND_GPU",
            "count": 1,
            "gpus": [
                0
            ],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.graphdef",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {},
    "model_warmup": []
}
  • Use perf_analyzer to benchmark the model. Below is the result of batch sizes from 1 to 128. The time increases linearly
root@d802522be317:/workspace# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 1 --measurement-mode count_windows --measurement-request-count=10 -u 10.3.12.37:8001
*** Measurement Settings ***
  Batch size: 1
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 76
    Throughput: 75.9241 infer/sec
    Avg latency: 13137 usec (standard deviation 871 usec)
    p50 latency: 12867 usec
    p90 latency: 13907 usec
    p95 latency: 14462 usec
    p99 latency: 16183 usec
    Avg gRPC time: 13116 usec ((un)marshal request/response 326 usec + response wait 12790 usec)
  Server: 
    Inference count: 76
    Execution count: 76
    Successful request count: 76
    Avg request latency: 10140 usec (overhead 94 usec + queue 24 usec + compute input 183 usec + compute infer 9827 usec + compute output 12 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 75.9241 infer/sec, latency 13137 usec
root@d802522be317:/workspace# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 2 --measurement-mode count_windows --measurement-request-count=10 -u 10.3.12.37:8001
*** Measurement Settings ***
  Batch size: 2
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 52
    Throughput: 103.896 infer/sec
    Avg latency: 19165 usec (standard deviation 1285 usec)
    p50 latency: 18865 usec
    p90 latency: 20444 usec
    p95 latency: 21123 usec
    p99 latency: 22487 usec
    Avg gRPC time: 19144 usec ((un)marshal request/response 470 usec + response wait 18674 usec)
  Server: 
    Inference count: 104
    Execution count: 52
    Successful request count: 52
    Avg request latency: 12955 usec (overhead 93 usec + queue 26 usec + compute input 303 usec + compute infer 12522 usec + compute output 10 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 103.896 infer/sec, latency 19165 usec
root@d802522be317:/workspace# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 4 --measurement-mode count_windows --measurement-request-count=10 -u 10.3.12.37:8001
*** Measurement Settings ***
  Batch size: 4
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 31
    Throughput: 124 infer/sec
    Avg latency: 32096 usec (standard deviation 1606 usec)
    p50 latency: 31647 usec
    p90 latency: 34947 usec
    p95 latency: 35169 usec
    p99 latency: 35615 usec
    Avg gRPC time: 32072 usec ((un)marshal request/response 941 usec + response wait 31131 usec)
  Server: 
    Inference count: 124
    Execution count: 31
    Successful request count: 31
    Avg request latency: 19304 usec (overhead 90 usec + queue 24 usec + compute input 573 usec + compute infer 18606 usec + compute output 10 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 124 infer/sec, latency 32096 usec
root@d802522be317:/workspace# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 8 --measurement-mode count_windows --measurement-request-count=10 -u 10.3.12.37:8001
*** Measurement Settings ***
  Batch size: 8
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 17
    Throughput: 136 infer/sec
    Avg latency: 57567 usec (standard deviation 1818 usec)
    p50 latency: 57267 usec
    p90 latency: 60173 usec
    p95 latency: 60729 usec
    p99 latency: 61739 usec
    Avg gRPC time: 57660 usec ((un)marshal request/response 1498 usec + response wait 56162 usec)
  Server: 
    Inference count: 144
    Execution count: 18
    Successful request count: 18
    Avg request latency: 32837 usec (overhead 137 usec + queue 38 usec + compute input 1584 usec + compute infer 31054 usec + compute output 22 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 136 infer/sec, latency 57567 usec
root@d802522be317:/workspace# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 16 --measurement-mode count_windows --measurement-request-count=10 -u 10.3.12.37:8001
*** Measurement Settings ***
  Batch size: 16
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 18
    Throughput: 144 infer/sec
    Avg latency: 109309 usec (standard deviation 3746 usec)
    p50 latency: 107947 usec
    p90 latency: 114506 usec
    p95 latency: 116352 usec
    p99 latency: 117348 usec
    Avg gRPC time: 109288 usec ((un)marshal request/response 3022 usec + response wait 106266 usec)
  Server: 
    Inference count: 288
    Execution count: 18
    Successful request count: 18
    Avg request latency: 59491 usec (overhead 164 usec + queue 34 usec + compute input 3309 usec + compute infer 55956 usec + compute output 27 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 144 infer/sec, latency 109309 usec
root@d802522be317:/workspace# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 32 --measurement-mode count_windows --measurement-request-count=10 -u 10.3.12.37:8001
*** Measurement Settings ***
  Batch size: 32
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 13
    Throughput: 138.62 infer/sec
    Avg latency: 222231 usec (standard deviation 24533 usec)
    p50 latency: 210092 usec
    p90 latency: 267586 usec
    p95 latency: 267586 usec
    p99 latency: 287637 usec
    Avg gRPC time: 222213 usec ((un)marshal request/response 7342 usec + response wait 214871 usec)
  Server: 
    Inference count: 416
    Execution count: 13
    Successful request count: 13
    Avg request latency: 115993 usec (overhead 239 usec + queue 37 usec + compute input 6783 usec + compute infer 108889 usec + compute output 43 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 138.62 infer/sec, latency 222231 usec
root@d802522be317:/workspace# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 64 --measurement-mode count_windows --measurement-request-count=10 -u 10.3.12.37:8001
*** Measurement Settings ***
  Batch size: 64
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 11
    Throughput: 140.772 infer/sec
    Avg latency: 433568 usec (standard deviation 4913 usec)
    p50 latency: 433047 usec
    p90 latency: 440276 usec
    p95 latency: 442303 usec
    p99 latency: 442303 usec
    Avg gRPC time: 433547 usec ((un)marshal request/response 13015 usec + response wait 420532 usec)
  Server: 
    Inference count: 704
    Execution count: 11
    Successful request count: 11
    Avg request latency: 265554 usec (overhead 313 usec + queue 43 usec + compute input 48010 usec + compute infer 217111 usec + compute output 77 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 140.772 infer/sec, latency 433568 usec
root@d802522be317:/workspace# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 128 --measurement-mode count_windows --measurement-request-count=10 -u 10.3.12.37:8001
*** Measurement Settings ***
  Batch size: 128
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 9
    Throughput: 127.986 infer/sec
    Avg latency: 915560 usec (standard deviation 25010 usec)
    p50 latency: 912273 usec
    p90 latency: 930086 usec
    p95 latency: 969772 usec
    p99 latency: 969772 usec
    Avg gRPC time: 911290 usec ((un)marshal request/response 27866 usec + response wait 883424 usec)
  Server: 
    Inference count: 1280
    Execution count: 10
    Successful request count: 10
    Avg request latency: 521936 usec (overhead 364 usec + queue 47 usec + compute input 95572 usec + compute infer 425814 usec + compute output 139 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 127.986 infer/sec, latency 915560 usec

Expected behavior I expect the compute infer time in the perf_analyzer result to stay roughly constant as long as the batch size is smaller than the max batch size of 128.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
Kokkinicommented, Jun 14, 2022

Turned out batching doesn’t work as I thought. Batching doesn’t mean the GPU will process all images at the same time. The GPU only has fewer than 10k cores and it’s normal for a layer to have 100M multiplications for each image so the GPU still has to break each layer into smaller parts. The GPU will do all multiplications in the part in parallel and then move to the next part. Adding images to a batch will create more parts, thus increasing the time almost linearly. Triton works well, there’s no problem with it. Thanks @tanmayv25 for helping me out.

0reactions
Kokkinicommented, Jun 10, 2022

Thank you @tanmayv25 . I tried --shared-memory=cuda but the result was still the same. Below is the result. I think the actual infer time (matrix multiplication, excluding moving data) should be less than linear and that time should be significant enough that it bends the graph to sub-linear instead of linear. Do you know what else may be happening here?

root@ip-10-3-12-37:/home/ubuntu# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 1 --measurement-mode count_windows --measurement-request-count=10 --shared-memory=cuda
*** Measurement Settings ***
  Batch size: 1
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 105
    Throughput: 105 infer/sec
    Avg latency: 9469 usec (standard deviation 95 usec)
    p50 latency: 9473 usec
    p90 latency: 9580 usec
    p95 latency: 9599 usec
    p99 latency: 9828 usec
    Avg gRPC time: 9463 usec ((un)marshal request/response 5 usec + response wait 9458 usec)
  Server: 
    Inference count: 106
    Execution count: 106
    Successful request count: 106
    Avg request latency: 9281 usec (overhead 56 usec + queue 16 usec + compute input 290 usec + compute infer 8897 usec + compute output 21 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 105 infer/sec, latency 9469 usec
root@ip-10-3-12-37:/home/ubuntu# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 2 --measurement-mode count_windows --measurement-request-count=10 --shared-memory=cuda
*** Measurement Settings ***
  Batch size: 2
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 76
    Throughput: 152 infer/sec
    Avg latency: 13102 usec (standard deviation 121 usec)
    p50 latency: 13091 usec
    p90 latency: 13179 usec
    p95 latency: 13232 usec
    p99 latency: 13482 usec
    Avg gRPC time: 13096 usec ((un)marshal request/response 5 usec + response wait 13091 usec)
  Server: 
    Inference count: 152
    Execution count: 76
    Successful request count: 76
    Avg request latency: 12914 usec (overhead 60 usec + queue 16 usec + compute input 563 usec + compute infer 12251 usec + compute output 23 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 152 infer/sec, latency 13102 usec
root@ip-10-3-12-37:/home/ubuntu# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 4 --measurement-mode count_windows --measurement-request-count=10 --shared-memory=cuda
*** Measurement Settings ***
  Batch size: 4
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 49
    Throughput: 196 infer/sec
    Avg latency: 20204 usec (standard deviation 218 usec)
    p50 latency: 20117 usec
    p90 latency: 20541 usec
    p95 latency: 20680 usec
    p99 latency: 20843 usec
    Avg gRPC time: 20197 usec ((un)marshal request/response 3 usec + response wait 20194 usec)
  Server: 
    Inference count: 200
    Execution count: 50
    Successful request count: 50
    Avg request latency: 20040 usec (overhead 56 usec + queue 16 usec + compute input 1132 usec + compute infer 18810 usec + compute output 25 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 196 infer/sec, latency 20204 usec
root@ip-10-3-12-37:/home/ubuntu# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 8 --measurement-mode count_windows --measurement-request-count=10 --shared-memory=cuda
*** Measurement Settings ***
  Batch size: 8
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 29
    Throughput: 231.768 infer/sec
    Avg latency: 34281 usec (standard deviation 373 usec)
    p50 latency: 34237 usec
    p90 latency: 34741 usec
    p95 latency: 34946 usec
    p99 latency: 35024 usec
    Avg gRPC time: 34291 usec ((un)marshal request/response 5 usec + response wait 34286 usec)
  Server: 
    Inference count: 240
    Execution count: 30
    Successful request count: 30
    Avg request latency: 34101 usec (overhead 64 usec + queue 17 usec + compute input 2303 usec + compute infer 31683 usec + compute output 32 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 231.768 infer/sec, latency 34281 usec
root@ip-10-3-12-37:/home/ubuntu# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 16 --measurement-mode count_windows --measurement-request-count=10 --shared-memory=cuda
*** Measurement Settings ***
  Batch size: 16
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 16
    Throughput: 256 infer/sec
    Avg latency: 61010 usec (standard deviation 428 usec)
    p50 latency: 61037 usec
    p90 latency: 61515 usec
    p95 latency: 61515 usec
    p99 latency: 61840 usec
    Avg gRPC time: 61001 usec ((un)marshal request/response 6 usec + response wait 60995 usec)
  Server: 
    Inference count: 256
    Execution count: 16
    Successful request count: 16
    Avg request latency: 60758 usec (overhead 76 usec + queue 18 usec + compute input 4962 usec + compute infer 55656 usec + compute output 45 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 256 infer/sec, latency 61010 usec
root@ip-10-3-12-37:/home/ubuntu# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 32 --measurement-mode count_windows --measurement-request-count=10 --shared-memory=cuda
*** Measurement Settings ***
  Batch size: 32
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 16
    Throughput: 256 infer/sec
    Avg latency: 118428 usec (standard deviation 495 usec)
    p50 latency: 118594 usec
    p90 latency: 119104 usec
    p95 latency: 119104 usec
    p99 latency: 119112 usec
    Avg gRPC time: 118416 usec ((un)marshal request/response 6 usec + response wait 118410 usec)
  Server: 
    Inference count: 544
    Execution count: 17
    Successful request count: 17
    Avg request latency: 118167 usec (overhead 74 usec + queue 17 usec + compute input 11955 usec + compute infer 106057 usec + compute output 64 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 256 infer/sec, latency 118428 usec
root@ip-10-3-12-37:/home/ubuntu# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 64 --measurement-mode count_windows --measurement-request-count=10 --shared-memory=cuda
*** Measurement Settings ***
  Batch size: 64
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 11
    Throughput: 234.667 infer/sec
    Avg latency: 271585 usec (standard deviation 820 usec)
    p50 latency: 271903 usec
    p90 latency: 272219 usec
    p95 latency: 272539 usec
    p99 latency: 272539 usec
    Avg gRPC time: 271573 usec ((un)marshal request/response 8 usec + response wait 271565 usec)
  Server: 
    Inference count: 704
    Execution count: 11
    Successful request count: 11
    Avg request latency: 271315 usec (overhead 101 usec + queue 19 usec + compute input 59378 usec + compute infer 211698 usec + compute output 118 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 234.667 infer/sec, latency 271585 usec
root@ip-10-3-12-37:/home/ubuntu# perf_analyzer -m inception_graphdef --service-kind triton -i grpc -b 128 --measurement-mode count_windows --measurement-request-count=10 --shared-memory=cuda
*** Measurement Settings ***
  Batch size: 128
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 9
    Throughput: 230.354 infer/sec
    Avg latency: 527078 usec (standard deviation 914 usec)
    p50 latency: 527016 usec
    p90 latency: 528407 usec
    p95 latency: 528620 usec
    p99 latency: 528620 usec
    Avg gRPC time: 527000 usec ((un)marshal request/response 9 usec + response wait 526991 usec)
  Server: 
    Inference count: 1280
    Execution count: 10
    Successful request count: 10
    Avg request latency: 526711 usec (overhead 121 usec + queue 19 usec + compute input 118358 usec + compute infer 407990 usec + compute output 222 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 230.354 infer/sec, latency 527078 usec
Read more comments on GitHub >

github_iconTop Results From Across the Web

Batch inference timings increasing almost linearly #739 - GitHub
and am doing batch inference for batch sizes between 1 to 4 . As per TensorRT documentation the inference time should remain roughly...
Read more >
Latency linearly increases when increased batch size or ...
I'm using TRT 5.1.5.0, C++ API, and converted the network from UFF. Inference times: Batch size 1: 12.7ms Batch size 2: 25.2ms Batch...
Read more >
What should the batch size of a neural network be at ... - Quora
The batch size at inference (using the network after you've trained) should be however many inputs you want to process simultaneously.
Read more >
Linear relation between batch size and inference time per batch
I was hoping to see a constant inference time per batch over all batch sizes because they are running in parallel on the...
Read more >
Difference Between a Batch and an Epoch in a Neural Network
When the batch is the size of one sample, the learning algorithm is called stochastic gradient descent. When the batch size is more...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found