Ensemble model throughput lower than member models
See original GitHub issueDescription I have a ensemble model C := [A (gpu), B(cpu)] setup. The ensemble model throughput (query per second) is significantly slower than any of the member models. Please provide suggestions of improvement
Triton Information What version of Triton are you using?
commit 8ecd15d31e028c69a611c227d57d909d04bdfa22 (HEAD -> master, origin/master, origin/HEAD)
Author: Iman Tabrizian <itabrizian@nvidia.com>
Date: Thu Apr 8 18:05:34 2021 -0400
To Reproduce Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
Model A (sfpn16) A semantic segmentation model, given an uncompressed image and outputs the segmentation mask
platform: "pytorch_libtorch"
default_model_filename: "model.pt"
max_batch_size: 32
input {
name: "INPUT__0"
data_type: TYPE_FP16
dims: [3, 256, 448]
}
output {
name: "OUTPUT__0"
data_type: TYPE_FP16
dims: [2, 256, 448]
}
dynamic_batching {
max_queue_delay_microseconds: 150000
}
instance_group [ { count: 2 }]
model B(pypost16) - Given a raw image, encode it to jpeg bytebuffer.
backend: "python"
max_batch_size: 0
input [
{
name: "INPUT"
data_type: TYPE_FP16
dims: [-1, 256, 448]
}
]
output [
{
name: "OUTPUT"
data_type: TYPE_STRING
dims: [1]
}
]
instance_group [ {
count: 1
kind: KIND_CPU
}
]
model C (sfpn16_pypost16)
platform: "ensemble"
max_batch_size: 0
input [
{
name: "INPUT"
data_type: TYPE_FP16
dims: [3, 256, 448]
}
]
output {
name: "OUTPUT"
data_type: TYPE_STRING
dims: [ 1 ]
}
ensemble_scheduling {
step [
{
model_name: "sfpn16"
model_version: -1
input_map {
key: "INPUT__0"
value: "INPUT"
}
output_map {
key: "OUTPUT__0"
value: "raw_mask"
}
},
{
model_name: "pypost16"
model_version: -1
input_map {
key: "INPUT"
value: "raw_mask"
}
output_map {
key: "OUTPUT"
value: "OUTPUT"
}
}
]
}
Bombardier test commands Model A (sfpn16)
+ ./bombardier -c 128 -r 900 -t 1s -d 10s -m POST -H 'Inference-Header-Content-Length: 154' -f /tmp/postfile http://localhost:8000/v2/models/sfpn16/infer
Bombarding http://localhost:8000/v2/models/sfpn16/infer for 10s using 128 connection(s)
[==============================================================================================================] 10s
Done!
Statistics Avg Stdev Max
Reqs/sec 314.86 621.58 3209.88
Latency 382.76ms 42.71ms 515.15ms
HTTP codes:
1xx - 0, 2xx - 3277, 3xx - 0, 4xx - 0, 5xx - 0
others - 0
Throughput: 347.20MB/s
Model B (pypost16)
+ ./bombardier -c 128 -r 900 -t 1s -d 10s -m POST -H 'Inference-Header-Content-Length: 151' -f /tmp/postfile http://localhost:8000/v2/models/pypost16/infer
Bombarding http://localhost:8000/v2/models/pypost16/infer for 10s using 128 connection(s)
[==============================================================================================================] 10s
Done!
Statistics Avg Stdev Max
Reqs/sec 420.80 43.39 502.08
Latency 296.59ms 37.25ms 348.71ms
HTTP codes:
1xx - 0, 2xx - 4332, 3xx - 0, 4xx - 0, 5xx - 0
others - 0
Throughput: 276.81MB/s
Model C (sfpn16_pypost16)
+ ./bombardier -c 128 -r 900 -t 1s -d 10s -m POST -H 'Inference-Header-Content-Length: 149' -f /tmp/postfile http://localhost:8000/v2/models/sfpn16_pypost16/infer
Bombarding http://localhost:8000/v2/models/sfpn16_pypost16/infer for 10s using 128 connection(s)
[==============================================================================================================] 10s
Done!
Statistics Avg Stdev Max
Reqs/sec 141.40 106.35 399.37
Latency 844.28ms 213.55ms 2.90s
HTTP codes:
1xx - 0, 2xx - 1538, 3xx - 0, 4xx - 0, 5xx - 0
others - 0
Throughput: 99.48MB/s
Triton Log(verbose=2)
I0423 01:48:29.729674 4024 http_server.cc:1229] HTTP request: 2 /v2/models/sfpn16_pypost16/infer
I0423 01:48:29.729811 4024 model_repository_manager.cc:659] GetInferenceBackend() 'sfpn16_pypost16' version -1
I0423 01:48:29.729869 4024 model_repository_manager.cc:659] GetInferenceBackend() 'sfpn16_pypost16' version -1
I0423 01:48:29.729956 4024 infer_request.cc:497] prepared: [0x0x7f23299105a0] request id: , model: sfpn16_pypost16, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f2329304518] input: INPUT, type: FP16, original shape: [3,256,448], batch + shape: [3,256,448], shape: [3,256,448]
override inputs:
inputs:
[0x0x7f2329304518] input: INPUT, type: FP16, original shape: [3,256,448], batch + shape: [3,256,448], shape: [3,256,448]
original requested outputs:
requested outputs:
OUTPUT
I0423 01:48:29.730021 4024 model_repository_manager.cc:659] GetInferenceBackend() 'sfpn16' version -1
I0423 01:48:29.730046 4024 model_repository_manager.cc:659] GetInferenceBackend() 'pypost16' version -1
I0423 01:48:29.730109 4024 infer_request.cc:497] prepared: [0x0x7f23290ff490] request id: , model: sfpn16, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f2329302d98] input: INPUT__0, type: FP16, original shape: [1,3,256,448], batch + shape: [1,3,256,448], shape: [3,256,448]
override inputs:
inputs:
[0x0x7f2329302d98] input: INPUT__0, type: FP16, original shape: [1,3,256,448], batch + shape: [1,3,256,448], shape: [3,256,448]
original requested outputs:
OUTPUT__0
requested outputs:
OUTPUT__0
I0423 01:48:29.880404 4024 libtorch.cc:1087] model sfpn16, instance sfpn16_0_1, executing 1 requests
I0423 01:48:29.880465 4024 libtorch.cc:504] TRITONBACKEND_ModelExecute: Running sfpn16_0_1 with 1 requests
I0423 01:48:29.916852 4024 infer_response.cc:165] add response output: output: OUTPUT__0, type: FP16, shape: [1,2,256,448]
I0423 01:48:29.916892 4024 ensemble_scheduler.cc:509] Internal response allocation: OUTPUT__0, size 458752, addr 0x7f24520a8000, memory type 2, type id 0
I0423 01:48:29.917185 4024 ensemble_scheduler.cc:524] Internal response release: size 458752, addr 0x7f24520a8000
I0423 01:48:29.917229 4024 infer_request.cc:497] prepared: [0x0x7f231c65d370] request id: , model: pypost16, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f2370e08958] input: INPUT, type: FP16, original shape: [2,256,448], batch + shape: [2,256,448], shape: [2,256,448]
override inputs:
inputs:
[0x0x7f2370e08958] input: INPUT, type: FP16, original shape: [2,256,448], batch + shape: [2,256,448], shape: [2,256,448]
original requested outputs:
OUTPUT
requested outputs:
OUTPUT
I0423 01:48:29.917366 4024 python.cc:354] model pypost16, instance pypost16_0, executing 1 requests
I0423 01:48:29.920917 4024 infer_response.cc:165] add response output: output: OUTPUT, type: UINT8, shape: [1344]
I0423 01:48:29.920983 4024 pinned_memory_manager.cc:131] pinned memory allocation: size 1344, addr 0x7f2462000090
I0423 01:48:29.921001 4024 ensemble_scheduler.cc:509] Internal response allocation: OUTPUT, size 1344, addr 0x7f2462000090, memory type 1, type id 0
I0423 01:48:29.921036 4024 ensemble_scheduler.cc:524] Internal response release: size 1344, addr 0x7f2462000090
I0423 01:48:29.921058 4024 infer_response.cc:139] add response output: output: OUTPUT, type: UINT8, shape: [1344]
I0423 01:48:29.921081 4024 http_server.cc:1180] HTTP: unable to provide 'OUTPUT' in CPU_PINNED, will use CPU
I0423 01:48:29.921121 4024 http_server.cc:1200] HTTP using buffer for: 'OUTPUT', size: 1344, addr: 0x7f2418007580
I0423 01:48:29.921162 4024 pinned_memory_manager.cc:158] pinned memory deallocation: addr 0x7f2462000090
I0423 01:48:29.921277 4024 http_server.cc:1215] HTTP release: size 1344, addr 0x7f2418007580
I0423 01:48:29.921352 4024 python.cc:695] TRITONBACKEND_ModelInstanceExecute: model instance name pypost16_0 released 1 requests
Expected behavior
The model A is GPU computation bounded (gpu util >95) and model B to be CPU bounded. I expect the query-per-second of model C should be the smaller of the two member models, which is min(314, 420)
, but in reality I obtain 141 qps.
Ablation studies
- switch model B(pypost16) to a identity python_backend does not help
- reduce A(sfpn16) to maxbatchsize=1 result in (A/sfpn16: 145qps, C/sfpn16_pypost16: 147qps)
- reduce A(sfpn16) to maxbatchsize=4 result in (A/sfpn16: 230qps, C/sfpn16_pypost16: 191qps)
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (7 by maintainers)
Top GitHub Comments
The DtoH transfer from pytorch model (sfpn16) to python model (pypost16) was slower than pypost16 can consume. The async transfer did not happen as (hoped or expected). Possible causes are:
(1) DtoH transfer (red bars) share the same stream_id with gpu compute kernel launcher. Conversely, the http request to gpu HtoD transfer is done on a separate stream. (green bars).
In comparison, bare sfpn16 model (no ensemble) keeps DtoH, HtoD on a separate stream from the compute kernel launcher stream.
(2) Host memory not pinned (observed in nsight as pageable) even with following configuration added.
This happens for both sfpn16 and sfpn16_pypost16
@deadeyegoodwin is there a design doc/example specifying the dataflow (memory type, who initiate copy) etc.
One thing to note is that almost all the latency in all cases is from the request sitting in the queue. How did you choose a concurrency of 128 to measure at. Did you look at the entire latency vs throughput curve? You might also need to use the profiler (nsight systems) to try to understand if there are unexpected bottlenecks in the execution of the ensemble.