Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ensemble model throughput lower than member models

See original GitHub issue

Description I have a ensemble model C := [A (gpu), B(cpu)] setup. The ensemble model throughput (query per second) is significantly slower than any of the member models. Please provide suggestions of improvement

Triton Information What version of Triton are you using?

commit 8ecd15d31e028c69a611c227d57d909d04bdfa22 (HEAD -> master, origin/master, origin/HEAD)
Author: Iman Tabrizian <itabrizian@nvidia.com>
Date:   Thu Apr 8 18:05:34 2021 -0400

To Reproduce Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Model A (sfpn16) A semantic segmentation model, given an uncompressed image and outputs the segmentation mask

platform: "pytorch_libtorch"
default_model_filename: "model.pt"
max_batch_size: 32

input {
  name: "INPUT__0"
  data_type: TYPE_FP16
  dims: [3, 256, 448]
}

output {
  name: "OUTPUT__0"
  data_type: TYPE_FP16
  dims: [2, 256, 448]
}

dynamic_batching {
  max_queue_delay_microseconds: 150000
}

instance_group [ { count: 2 }]

model B(pypost16) - Given a raw image, encode it to jpeg bytebuffer.

backend: "python"
max_batch_size: 0
input [
{
    name: "INPUT"
    data_type: TYPE_FP16
    dims: [-1, 256, 448]
}
]

output [
{
    name: "OUTPUT"
    data_type: TYPE_STRING
    dims: [1]
}
]
instance_group [ { 
    count: 1
    kind: KIND_CPU
}
]

model C (sfpn16_pypost16)

platform: "ensemble"
max_batch_size: 0
input [
  {
      name: "INPUT"
      data_type: TYPE_FP16
      dims: [3, 256, 448]
  }
]
output {
  name: "OUTPUT"
  data_type: TYPE_STRING
  dims: [ 1 ]
}
ensemble_scheduling {
  step [
    {
      model_name: "sfpn16"
      model_version: -1
      input_map {
        key: "INPUT__0"
        value: "INPUT"
      }
      output_map {
        key: "OUTPUT__0"
        value: "raw_mask"
      }
    },
    {
      model_name: "pypost16"
      model_version: -1
      input_map {
        key: "INPUT"
        value: "raw_mask"
      }
      output_map {
        key: "OUTPUT"
        value: "OUTPUT"
      }
    }
  ]
}

Bombardier test commands Model A (sfpn16)

+ ./bombardier -c 128 -r 900 -t 1s -d 10s -m POST -H 'Inference-Header-Content-Length: 154' -f /tmp/postfile http://localhost:8000/v2/models/sfpn16/infer
Bombarding http://localhost:8000/v2/models/sfpn16/infer for 10s using 128 connection(s)
[==============================================================================================================] 10s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec       314.86     621.58    3209.88
  Latency      382.76ms    42.71ms   515.15ms
  HTTP codes:
    1xx - 0, 2xx - 3277, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:   347.20MB/s

Model B (pypost16)

+ ./bombardier -c 128 -r 900 -t 1s -d 10s -m POST -H 'Inference-Header-Content-Length: 151' -f /tmp/postfile http://localhost:8000/v2/models/pypost16/infer
Bombarding http://localhost:8000/v2/models/pypost16/infer for 10s using 128 connection(s)
[==============================================================================================================] 10s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec       420.80      43.39     502.08
  Latency      296.59ms    37.25ms   348.71ms
  HTTP codes:
    1xx - 0, 2xx - 4332, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:   276.81MB/s

Model C (sfpn16_pypost16)

+ ./bombardier -c 128 -r 900 -t 1s -d 10s -m POST -H 'Inference-Header-Content-Length: 149' -f /tmp/postfile http://localhost:8000/v2/models/sfpn16_pypost16/infer
Bombarding http://localhost:8000/v2/models/sfpn16_pypost16/infer for 10s using 128 connection(s)
[==============================================================================================================] 10s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec       141.40     106.35     399.37
  Latency      844.28ms   213.55ms      2.90s
  HTTP codes:
    1xx - 0, 2xx - 1538, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:    99.48MB/s

Triton Log(verbose=2)

I0423 01:48:29.729674 4024 http_server.cc:1229] HTTP request: 2 /v2/models/sfpn16_pypost16/infer
I0423 01:48:29.729811 4024 model_repository_manager.cc:659] GetInferenceBackend() 'sfpn16_pypost16' version -1
I0423 01:48:29.729869 4024 model_repository_manager.cc:659] GetInferenceBackend() 'sfpn16_pypost16' version -1
I0423 01:48:29.729956 4024 infer_request.cc:497] prepared: [0x0x7f23299105a0] request id: , model: sfpn16_pypost16, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f2329304518] input: INPUT, type: FP16, original shape: [3,256,448], batch + shape: [3,256,448], shape: [3,256,448]
override inputs:
inputs:
[0x0x7f2329304518] input: INPUT, type: FP16, original shape: [3,256,448], batch + shape: [3,256,448], shape: [3,256,448]
original requested outputs:
requested outputs:
OUTPUT

I0423 01:48:29.730021 4024 model_repository_manager.cc:659] GetInferenceBackend() 'sfpn16' version -1
I0423 01:48:29.730046 4024 model_repository_manager.cc:659] GetInferenceBackend() 'pypost16' version -1
I0423 01:48:29.730109 4024 infer_request.cc:497] prepared: [0x0x7f23290ff490] request id: , model: sfpn16, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f2329302d98] input: INPUT__0, type: FP16, original shape: [1,3,256,448], batch + shape: [1,3,256,448], shape: [3,256,448]
override inputs:
inputs:
[0x0x7f2329302d98] input: INPUT__0, type: FP16, original shape: [1,3,256,448], batch + shape: [1,3,256,448], shape: [3,256,448]
original requested outputs:
OUTPUT__0
requested outputs:
OUTPUT__0

I0423 01:48:29.880404 4024 libtorch.cc:1087] model sfpn16, instance sfpn16_0_1, executing 1 requests
I0423 01:48:29.880465 4024 libtorch.cc:504] TRITONBACKEND_ModelExecute: Running sfpn16_0_1 with 1 requests
I0423 01:48:29.916852 4024 infer_response.cc:165] add response output: output: OUTPUT__0, type: FP16, shape: [1,2,256,448]
I0423 01:48:29.916892 4024 ensemble_scheduler.cc:509] Internal response allocation: OUTPUT__0, size 458752, addr 0x7f24520a8000, memory type 2, type id 0
I0423 01:48:29.917185 4024 ensemble_scheduler.cc:524] Internal response release: size 458752, addr 0x7f24520a8000
I0423 01:48:29.917229 4024 infer_request.cc:497] prepared: [0x0x7f231c65d370] request id: , model: pypost16, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f2370e08958] input: INPUT, type: FP16, original shape: [2,256,448], batch + shape: [2,256,448], shape: [2,256,448]
override inputs:
inputs:
[0x0x7f2370e08958] input: INPUT, type: FP16, original shape: [2,256,448], batch + shape: [2,256,448], shape: [2,256,448]
original requested outputs:
OUTPUT
requested outputs:
OUTPUT

I0423 01:48:29.917366 4024 python.cc:354] model pypost16, instance pypost16_0, executing 1 requests
I0423 01:48:29.920917 4024 infer_response.cc:165] add response output: output: OUTPUT, type: UINT8, shape: [1344]
I0423 01:48:29.920983 4024 pinned_memory_manager.cc:131] pinned memory allocation: size 1344, addr 0x7f2462000090
I0423 01:48:29.921001 4024 ensemble_scheduler.cc:509] Internal response allocation: OUTPUT, size 1344, addr 0x7f2462000090, memory type 1, type id 0
I0423 01:48:29.921036 4024 ensemble_scheduler.cc:524] Internal response release: size 1344, addr 0x7f2462000090
I0423 01:48:29.921058 4024 infer_response.cc:139] add response output: output: OUTPUT, type: UINT8, shape: [1344]
I0423 01:48:29.921081 4024 http_server.cc:1180] HTTP: unable to provide 'OUTPUT' in CPU_PINNED, will use CPU
I0423 01:48:29.921121 4024 http_server.cc:1200] HTTP using buffer for: 'OUTPUT', size: 1344, addr: 0x7f2418007580
I0423 01:48:29.921162 4024 pinned_memory_manager.cc:158] pinned memory deallocation: addr 0x7f2462000090
I0423 01:48:29.921277 4024 http_server.cc:1215] HTTP release: size 1344, addr 0x7f2418007580
I0423 01:48:29.921352 4024 python.cc:695] TRITONBACKEND_ModelInstanceExecute: model instance name pypost16_0 released 1 requests

Expected behavior The model A is GPU computation bounded (gpu util >95) and model B to be CPU bounded. I expect the query-per-second of model C should be the smaller of the two member models, which is min(314, 420), but in reality I obtain 141 qps.

Ablation studies

switch model B(pypost16) to a identity python_backend does not help
reduce A(sfpn16) to maxbatchsize=1 result in (A/sfpn16: 145qps, C/sfpn16_pypost16: 147qps)
reduce A(sfpn16) to maxbatchsize=4 result in (A/sfpn16: 230qps, C/sfpn16_pypost16: 191qps)

Issue Analytics

State:
Created 2 years ago
Comments:8 (7 by maintainers)

Top GitHub Comments

1reaction

0wucommented, May 10, 2021

The DtoH transfer from pytorch model (sfpn16) to python model (pypost16) was slower than pypost16 can consume. The async transfer did not happen as (hoped or expected). Possible causes are:

(1) DtoH transfer (red bars) share the same stream_id with gpu compute kernel launcher. Conversely, the http request to gpu HtoD transfer is done on a separate stream. (green bars).

In comparison, bare sfpn16 model (no ensemble) keeps DtoH, HtoD on a separate stream from the compute kernel launcher stream.

(2) Host memory not pinned (observed in nsight as pageable) even with following configuration added.

optimization {
  input_pinned_memory {
    enable: true
  }
  output_pinned_memory {
    enable: true
  }
}

This happens for both sfpn16 and sfpn16_pypost16

Not sure if related The python backend doesn’t seem to use pinned memory either. https://github.com/triton-inference-server/python_backend/blob/c3dde1358186ca851fcc33cc0c31a134d0a62c3d/src/python.cc#L958-L960

@deadeyegoodwin is there a design doc/example specifying the dataflow (memory type, who initiate copy) etc.

1reaction

deadeyegoodwincommented, Apr 27, 2021

One thing to note is that almost all the latency in all cases is from the request sitting in the queue. How did you choose a concurrency of 128 to measure at. Did you look at the entire latency vs throughput curve? You might also need to use the profiler (nsight systems) to try to understand if there are unexpected bottlenecks in the execution of the ensemble.

Top Results From Across the Web

Ensemble Models: What Are They and When Should You Use ...

Sometimes one model isn't enough. In this guide to ensemble models, I'll walk you through how (and when) to use ensemble techniques for...

Ensemble methods: bagging, boosting and stacking

Very roughly, we can say that bagging will mainly focus at getting an ensemble model with less variance than its components whereas boosting ......

Ensemble Learning | Ensemble Techniques - Analytics Vidhya

Stacking is an ensemble learning technique that uses predictions from multiple models (for example decision tree, knn or svm) to build a new ......

Ensemble Modeling - an overview | ScienceDirect Topics

Ensemble modeling is a process where multiple diverse models are created to predict an outcome, either by using many different modeling algorithms or...

How do ensemble methods outperform all their constituents?

It's not guaranteed. As you say, the ensemble could be worse than the individual models. For example, taking the average of the true...