Tensorflow models don't seem to batch properly
See original GitHub issueDescription Tensorflow models downloaded from the TFOD model zoo load and work just fine, but dynamic batching doesn’t seem to work. TF2 models report “model signature does not support batching”. TF1 models load with dynamic_batching enabled, but latency scales linearly with concurrency > 1
Triton Information What version of Triton are you using? 20.09
Are you using the Triton container or did you build it yourself? Container
To Reproduce Download TF1 resnet50 faster RCNN from here Download TF2 resnet101 faster RCNN from here Load the models using --strict-model-config false Provide a minimal config.pbtxt enabling dynamic batching as below:
platform: "tensorflow_savedmodel"
max_batch_size: 2
dynamic_batching { }
Use perf_client to evaluate the model over concurrency from 1 to 8 as below:
perf_client -m fasterrcnn101v1640x640 --percentile=95 --shape input_tensor:1,640,640,3 -i gRPC --concurrency-range 1:8
With --log-verbose=1, the TF1 model shows the following:
"name": "fasterrcnn50_coco_2018_01_28",
"platform": "tensorflow_savedmodel",
"backend": "tensorflow",
"version_policy": {
"latest": {
"num_versions": 1
}
},
"max_batch_size": 1,
"input": [
{
"name": "inputs",
"data_type": "TYPE_UINT8",
"dims": [
-1,
-1,
3
]
}
],
"output": [
{
"name": "detection_scores",
"data_type": "TYPE_FP32",
"dims": [
100
]
},
{
"name": "detection_boxes",
"data_type": "TYPE_FP32",
"dims": [
100,
4
]
},
{
"name": "num_detections",
"data_type": "TYPE_FP32",
"reshape": {
"shape": []
},
"dims": [
1
]
},
{
"name": "detection_classes",
"data_type": "TYPE_FP32",
"dims": [
100
]
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"priority": "PRIORITY_DEFAULT",
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
}
},
"instance_group": [
{
"name": "fasterrcnn50_coco_2018_01_28",
"kind": "KIND_GPU",
"count": 1,
"gpus": [
0
],
"profile": []
}
],
"default_model_filename": "model.savedmodel",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {},
"model_warmup": []
}
The TF2 model shows the following:
"name": "fasterrcnn101v1640x640",
"platform": "tensorflow_savedmodel",
"backend": "tensorflow",
"version_policy": {
"latest": {
"num_versions": 1
}
},
"max_batch_size": 0,
"input": [
{
"name": "input_tensor",
"data_type": "TYPE_UINT8",
"dims": [
1,
-1,
-1,
3
]
}
],
"output": [
{
"name": "detection_scores",
"data_type": "TYPE_FP32",
"dims": [
1,
300
]
},
{
"name": "raw_detection_boxes",
"data_type": "TYPE_FP32",
"dims": [
1,
300,
4
]
},
{
"name": "detection_boxes",
"data_type": "TYPE_FP32",
"dims": [
1,
300,
4
]
},
{
"name": "num_detections",
"data_type": "TYPE_FP32",
"dims": [
1
]
},
{
"name": "detection_classes",
"data_type": "TYPE_FP32",
"dims": [
1,
300
]
},
{
"name": "detection_multiclass_scores",
"data_type": "TYPE_FP32",
"dims": [
1,
300,
91
]
},
{
"name": "detection_anchor_indices",
"data_type": "TYPE_FP32",
"dims": [
1,
300
]
},
{
"name": "raw_detection_scores",
"data_type": "TYPE_FP32",
"dims": [
1,
300,
91
]
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"priority": "PRIORITY_DEFAULT",
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
}
},
"instance_group": [
{
"name": "fasterrcnn101v1640x640",
"kind": "KIND_GPU",
"count": 1,
"gpus": [
0
],
"profile": []
}
],
"default_model_filename": "model.savedmodel",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {},
"model_warmup": []
}
And fails with the following message:
E1007 05:32:54.981619 1 model_repository_manager.cc:899] failed to load 'fasterrcnn101v1640x640' version 1: Internal: unable to autofill for 'fasterrcnn101v1640x640', configuration specified max-batch 2 but model signature does not support batching
Expected behavior Expect throughput to increase with increase concurrency. Instead throughput remains constant and latency scales linearly with concurrency. See summary results of perf_client below:
Concurrency: 1, throughput: 18.3 infer/sec, latency 62018 usec
Concurrency: 2, throughput: 18.5 infer/sec, latency 127477 usec
Concurrency: 3, throughput: 18.4 infer/sec, latency 190984 usec
Concurrency: 4, throughput: 18.6 infer/sec, latency 236294 usec
Concurrency: 5, throughput: 18.6 infer/sec, latency 292782 usec
Concurrency: 6, throughput: 17.8 infer/sec, latency 362044 usec
Concurrency: 7, throughput: 18.1 infer/sec, latency 426605 usec
Concurrency: 8, throughput: 18.2 infer/sec, latency 469077 usec
For comparison, an ONNX Yolov4 model gets the following results after optimization and dynamic batching enabled:
Concurrency: 1, throughput: 64.5 infer/sec, latency 17382 usec
Concurrency: 2, throughput: 82.3 infer/sec, latency 29741 usec
Concurrency: 3, throughput: 86.3 infer/sec, latency 45026 usec
Concurrency: 4, throughput: 86 infer/sec, latency 65735 usec
Concurrency: 5, throughput: 101.8 infer/sec, latency 70180 usec
Concurrency: 6, throughput: 115.9 infer/sec, latency 73805 usec
Concurrency: 7, throughput: 128.5 infer/sec, latency 78493 usec
Concurrency: 8, throughput: 140.3 infer/sec, latency 82236 usec
What do I need to do to enable batching with TF models? Do I need to export a saved model with a new input shape (-1, -1, -1, 3) rather than (1, -1, -1, 3)?
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (3 by maintainers)
Top GitHub Comments
Triton is not batching for the ONNX model. As you note it does not support batching. Perhaps you think it was batching because increasing the perf_analyzer concurrency resulted in increased throughput. That doesn’t necessarily require dynamic batching. Having 8 inference requests in flight at all times (concurrency 8) means that any network delays or other latencies can be hidden. Why doesn’t the TF model have scaling with increased concurrency? Perhaps the bottleneck for that is the model execution itself, so having more requests in flight does not actually help (although there is usually at least a small improvement going from concurrency 1 to 2). It doesn’t directly address your question by make sure you read https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/optimization.html
In both cases you have models that don’t support batching. See https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_configuration.html#inputs-and-outputs.
The model needs to have a variable-sized (-1) dimension for all inputs and outputs for Triton to be able to dynamically batch and the max_batch_size must be > 1. You need to file a ticket against the model zoo to find out why they are not producing models that can support batching.