Dynamic batching does not give better latency for Roberta running on TensorRT.
See original GitHub issueHi, I used your build_engine
API to convert the Roberta model. While building if I use the constant batch size for input_shapes, i.e. (min, optimal, max) -> (1,1,1) or (4, 4, 4,)
. The model yields good results (faster than ort and torch).
But when I convert it with dynamic batch size i.e. (min, optimal, max) -> (1, 4, 4)
, the model performs really slow compared to ort or torch.
code to understand the problem better:
# fast inference but constrained to use always 4 batches during inferencing
tensor_shapes = list(zip([4, 4, 4], [1, 128, 128]))
# slow inference
tensor_shapes = list(zip([1, 4, 4], [1, 128, 128]))
engine: ICudaEngine = build_engine(
runtime=runtime,
onnx_file_path=onnx_model_path,
logger=trt_logger,
min_shape=tensor_shapes[0],
optimal_shape=tensor_shapes[1],
max_shape=tensor_shapes[2],
workspace_size=workspace_size * 1024**3,
fp16=not quantization,
int8=quantization,
profiling=True,
)
save_engine(engine=engine, engine_file_path=tensorrt_path)
the complete build and inference logs for slow inference case (when converting with dynamic batch)
[06/02/2022-03:19:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +312, GPU +0, now: CPU 3789, GPU 2470 (MiB)
[06/02/2022-03:19:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 3790, GPU 2470 (MiB)
[06/02/2022-03:19:09] [TRT] [I] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 3790 MiB, GPU 2470 MiB
[06/02/2022-03:19:09] [TRT] [I] [MemUsageSnapshot] End constructing builder kernel library: CPU 3924 MiB, GPU 2504 MiB
[06/02/2022-03:19:09] [TRT] [I] parsing TensorRT model
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1418322027
[06/02/2022-03:19:22] [TRT] [W] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[06/02/2022-03:19:39] [TRT] [W] Output type must be INT32 for shape outputs
[06/02/2022-03:19:39] [TRT] [W] Output type must be INT32 for shape outputs
[06/02/2022-03:19:39] [TRT] [W] Output type must be INT32 for shape outputs
[06/02/2022-03:19:43] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +512, GPU +226, now: CPU 5802, GPU 2730 (MiB)
[06/02/2022-03:19:43] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +116, GPU +52, now: CPU 5918, GPU 2782 (MiB)
[06/02/2022-03:19:43] [TRT] [I] Timing cache disabled. Turning it on will improve builder speed.
[06/02/2022-03:19:43] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are:
[06/02/2022-03:19:43] [TRT] [W] (# 1 (SHAPE input_ids))
[06/02/2022-03:19:43] [TRT] [W] (# 0 (SHAPE attention_mask))
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
[06/02/2022-03:25:32] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are:
[06/02/2022-03:25:32] [TRT] [W] (# 1 (SHAPE input_ids))
[06/02/2022-03:25:32] [TRT] [W] (# 0 (SHAPE attention_mask))
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
[06/02/2022-03:30:10] [TRT] [I] Detected 2 inputs and 1 output network tensors.
[06/02/2022-03:30:10] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are:
[06/02/2022-03:30:10] [TRT] [W] (# 1 (SHAPE input_ids))
[06/02/2022-03:30:10] [TRT] [W] (# 0 (SHAPE attention_mask))
[06/02/2022-03:30:32] [TRT] [I] Total Host Persistent Memory: 208
[06/02/2022-03:30:32] [TRT] [I] Total Device Persistent Memory: 0
[06/02/2022-03:30:32] [TRT] [I] Total Scratch Memory: 442827264
[06/02/2022-03:30:32] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 774 MiB, GPU 2058 MiB
[06/02/2022-03:30:32] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 0.038945ms to assign 4 blocks to 4 nodes requiring 443041280 bytes.
[06/02/2022-03:30:32] [TRT] [I] Total Activation Memory: 443041280
[06/02/2022-03:30:32] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5993, GPU 4298 (MiB)
[06/02/2022-03:30:32] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 5993, GPU 4306 (MiB)
[06/02/2022-03:30:32] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +1353, now: CPU 0, GPU 1353 (MiB)
[06/02/2022-03:30:33] [TRT] [I] Loaded engine size: 1364 MiB
[06/02/2022-03:30:33] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 7354, GPU 4282 (MiB)
[06/02/2022-03:30:33] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 7355, GPU 4290 (MiB)
[06/02/2022-03:30:33] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1352, now: CPU 0, GPU 1352 (MiB)
[06/02/2022-03:30:38] [TRT] [I] Loaded engine size: 1364 MiB
[06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 7366, GPU 5636 (MiB)
[06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 7367, GPU 5644 (MiB)
[06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1352, now: CPU 0, GPU 2704 (MiB)
[06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 6002, GPU 5636 (MiB)
[06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 6002, GPU 5644 (MiB)
[06/02/2022-03:30:43] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +423, now: CPU 0, GPU 3127 (MiB)
latencies in ms
--------------------------------------------------
Pytorch
--------------------------------------------------
[93.5968, 94.0308, 94.8224, 93.6746, 94.5972, 94.0188, 92.3105, 93.6535, 92.4908, 91.4413]
--------------------------------------------------
Onnxruntime
--------------------------------------------------
[81.445, 81.3684, 80.2145, 81.5339, 82.9578, 83.6845, 83.6738, 82.6652, 81.5462, 82.8237]
--------------------------------------------------
TensorRT (FP16)
--------------------------------------------------
[426.353, 425.1992, 426.0317, 425.8226, 426.8828, 428.0485, 426.3119, 426.4556, 425.4863, 426.0393]
--------------------------------------------------
Is this the expected behavior?
I want to convert the model to use dynamic batches. When inferencing, the model should be able to handle a variable batch size and perform faster. How can I achieve that?
Any help would be greatly appreciated, thank you in advance.
Issue Analytics
- State:
- Created a year ago
- Comments:12 (8 by maintainers)
Top GitHub Comments
It depends of your use case and definitely requires real benchmark. In my own usecase, I got something like 10 batches of documents with 9 of the same size and 1 a bit smaller. Adding fake docs to the last one cost me very little compared to the boost offered by TRT. On the other side, my docs can be as small as 100 tokens to 400 tokens, so padding cost me a lot.
If you are working on super short documents (1-30 tokens), I would say that seq len is a viable option, but again, requires real measures, depends of your GPU, etc.
I have been able to reproduce your results. The error message regarding dynamic axis means that for best performances you should only use one dynamic axis (batch or seq len). https://forums.developer.nvidia.com/t/myelin-graph-error-when-converting-to-trt-engine-inference/202019/6
I have tried to keep batch size fixed and results are much better 😃
Don’t know your use case, but we had something similar and in our own use case, we added fake data to make axis fixed… It only works if most of your batch are full.
I have already noticed this behavior but never in those proportions.
for memory, when batch and seq len are both dynamic: