Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi instances's performance is slightly low. [pytorch_backend]

See original GitHub issue

Hi.

I found that the multi instances’s performance is slightly low. My environment and experiment results is below.

Machine Information

GPU : TITAN RTX * 3ea
OS : ubuntu 18.04
Triton with pytorch_backend : r22.08

Model config

name: "resnet101"
platform: "pytorch_libtorch"
max_batch_size: 32

instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ] or [ 0, 1 ]
  }
]

input [
  {
    name: "INPUT__0"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [3, 224, 224]
  }
]

output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [1000]
    label_filename: "resnet_labels.txt"
  }
]

version_policy: { specific: {versions: [1]}}

perf_analyzer results ( Single Instance gpu: [ 0 ] )

*** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 30000 msec
  Latency limit: 0 msec
  Request Rate limit: 200 requests per seconds
  Using uniform distribution on request generation
  Using asynchronous calls for inference
  Stabilizing using average latency

Request Rate: 200 inference requests per seconds
  Client:
    Request count: 3245
    Delayed Request Count: 537
    Throughput: 90.1381 infer/sec
    Avg latency: 9909994 usec (standard deviation 10169 usec)
    p50 latency: 9906561 usec
    p90 latency: 17777911 usec
    p95 latency: 18744365 usec
    p99 latency: 19588592 usec
    Avg HTTP time: 9908353 usec (send/recv 2162 usec + response wait 9906191 usec)
  Server:
    Inference count: 3245
    Execution count: 3245
    Successful request count: 3245
    Avg request latency: 9893671 usec (overhead 27 usec + queue 9882592 usec + compute input 191 usec + compute infer 10823 usec + compute output 37 usec)

Inferences/Second vs. Client Average Batch Latency
Request Rate: 200, throughput: 90.1381 infer/sec, latency 9909994 usec

perf_analyzer results ( Multi Instance gpu: [ 0, 1 ] )

*** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 30000 msec
  Latency limit: 0 msec
  Request Rate limit: 200 requests per seconds
  Using uniform distribution on request generation
  Using asynchronous calls for inference
  Stabilizing using average latency

Request Rate: 200 inference requests per seconds
  Client:
    Request count: 6230
    Delayed Request Count: 5593
    Throughput: 173.054 infer/sec
    Avg latency: 1381404 usec (standard deviation 19187 usec)
    p50 latency: 1344601 usec
    p90 latency: 2542631 usec
    p95 latency: 2859819 usec
    p99 latency: 3037915 usec
    Avg HTTP time: 1381659 usec (send/recv 536 usec + response wait 1381123 usec)
  Server:
    Inference count: 6231
    Execution count: 6231
    Successful request count: 6231
    Avg request latency: 1359799 usec (overhead 26 usec + queue 1348261 usec + compute input 172 usec + compute infer 11301 usec + compute output 39 usec)

Inferences/Second vs. Client Average Batch Latency
Request Rate: 200, throughput: 173.054 infer/sec, latency 1381404 usec

Although the performance gap between single and multi-instance is slight, i think that the performance has to be twice when using two instances. When I use three instances the throughput of perf_analyzer’s results is 253.442 infer / sec. That means if I add an instance on another GPU, their performance drops about 8~10 infer / sec compared to a single instance.

Why does this happen?

Is it because of request queue contention for two instances? or something other contention that I can not realize?

Thanks for reading.

Issue Analytics

State:
Created 10 months ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

tgerdesnvcommented, Dec 2, 2022

Please also note that if you are running perf analyzer on the same machine as Triton, then they are competing for CPU resources. That can impact your results, even for models running on the GPU.

As for “how many instances should I use?”, that is a question best answered by Model Analyzer. It is a wrapper around Perf Analyzer that will run multiple experiments to find the best configuration under any constraints you give it (such as a maximum latency budget).

1reaction

tgerdesnvcommented, Nov 29, 2022

Understood. They don’t share GPU resources but they do share the network for receiving and responding to requests. I suspect @dyastremsky was on the right track with this statement:

Glancing at your time breakdown, I actually see a decrease in the time for each request but an increase in send/recv time, so it looks like this may be HTTP-related.

Top Results From Across the Web

Performance Tuning Guide - PyTorch

Performance Tuning Guide is a set of optimizations and best practices which can accelerate training and inference of deep learning models in PyTorch....

MPS device appears much slower than CPU on M1 Mac Pro

Describe the bug Using MPS for BERT inference appears to produce about a 2x slowdown compared to the CPU. Here is code to...

How To Fit a Bigger Model and Train It Faster - Hugging Face

However, a larger batch size can often result in faster model convergence or better end performance. So ideally we want to tune the...

Distributed data parallel training using Pytorch on AWS

In this post, I'll describe how to use distributed data parallel techniques on multiple AWS GPU servers to speed up Machine Learning (ML) ......

PyTorch Distributed: Experiences on Accelerating Data ...

tributed data parallel model object with no additional code changes. Several techniques are integrated into the design to deliver high-performance training, ...