Score/confidence of prediction drop a lot after convert to trt engine
See original GitHub issueI am using Dyhead to train an image detection model: https://github.com/open-mmlab/mmdetection/blob/master/configs/dyhead/atss_swin-l-p4-w12_fpn_dyhead_mstrain_2x_coco.py
Using GPU docker, convert to tensorrt with tools/deploy.py success:
python3 tools/deploy.py /workdir/detection_onnx_static_1024x1024.py /workdir/atss_swin-l-p4-w12_fpn_dyhead_mstrain_2x.py /workdir/latest.pth /workdir/test-deploy-img-1024.jpg --device cuda --dump-info
Although the conversion have a lot warning like below:
/root/workspace/mmdeploy/mmdeploy/core/optimizers/function_marker.py:158: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
ys_shape = tuple(int(s) for s in ys.shape)
/root/workspace/mmdeploy/mmdeploy/codebase/mmdet/models/detectors/base.py:24: TracerWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results).
img_shape = [int(val) for val in img_shape]
/root/workspace/mmdeploy/mmdeploy/codebase/mmdet/models/backbones.py:202: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
slice_w = (W + self.window_size - 1) // self.window_size * self.window_size
WARNING: The shape inference of mmdeploy::MMCVModulatedDeformConv2d type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of mmdeploy::TRTInstanceNormalization type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
[09/01/2022-01:33:56] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +809, GPU +348, now: CPU 3460, GPU 785 (MiB)
[09/01/2022-01:33:56] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +126, GPU +60, now: CPU 3586, GPU 845 (MiB)
[09/01/2022-01:33:56] [TRT] [W] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.3.2
[09/01/2022-01:33:56] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[09/01/2022-01:35:06] [TRT] [I] Some tactics do not have sufficient workspace memory to run. Increasing workspace size will enable more tactics, please check verbose output for requested sizes.
[09/01/2022-01:36:46] [TRT] [I] Detected 1 inputs and 2 output network tensors.
[09/01/2022-01:36:48] [TRT] [I] Total Host Persistent Memory: 500480
[09/01/2022-01:36:48] [TRT] [I] Total Device Persistent Memory: 393728
[09/01/2022-01:36:48] [TRT] [I] Total Scratch Memory: 1918222336
[09/01/2022-01:36:48] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 0 MiB
[09/01/2022-01:36:53] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 5282.29ms to assign 35 blocks to 1075 nodes requiring 2039239680 bytes.
[09/01/2022-01:36:53] [TRT] [I] Total Activation Memory: 2039239680
[09/01/2022-01:36:53] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5632, GPU 2311 (MiB)
[09/01/2022-01:36:53] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 5632, GPU 2319 (MiB)
[09/01/2022-01:36:53] [TRT] [W] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.3.2
[09/01/2022-01:36:53] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[09/01/2022-01:36:53] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
[09/01/2022-01:36:53] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
2022-09-01 01:36:54,590 - mmdeploy - INFO - Finish pipeline mmdeploy.backend.tensorrt.onnx2tensorrt.onnx2tensorrt
2022-09-01 01:36:55,388 - mmdeploy - WARNING - "visualize_model" has been skipped may be because it's running on a headless device.
2022-09-01 01:36:55,388 - mmdeploy - INFO - All process success.
I know after conversion the result is not exactly the same so I am okey with bounding box value difference(although its off quite a bit too), but the score is kind of drop too much!
Below is the trt engine result (x1,y1,x2,y2,score)
[8.8506012, 358.41714, 149.80162, 495.56137, 0.081301391]
And below is the original predict with mmdet, the bbox have been round down
[0, 328, 165, 526, 0.53286]
Here is my env with python3 tools/check_env.py
2022-09-01 02:08:19,488 - mmdeploy - INFO -
2022-09-01 02:08:19,489 - mmdeploy - INFO - **********Environmental information**********
2022-09-01 02:08:19,681 - mmdeploy - INFO - sys.platform: linux
2022-09-01 02:08:19,681 - mmdeploy - INFO - Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
2022-09-01 02:08:19,681 - mmdeploy - INFO - CUDA available: True
2022-09-01 02:08:19,681 - mmdeploy - INFO - GPU 0: NVIDIA RTX A4000
2022-09-01 02:08:19,681 - mmdeploy - INFO - CUDA_HOME: /usr/local/cuda
2022-09-01 02:08:19,681 - mmdeploy - INFO - NVCC: Cuda compilation tools, release 11.6, V11.6.124
2022-09-01 02:08:19,681 - mmdeploy - INFO - GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
2022-09-01 02:08:19,681 - mmdeploy - INFO - PyTorch: 1.12.0
2022-09-01 02:08:19,681 - mmdeploy - INFO - PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201402
- Intel(R) oneAPI Math Kernel Library Version 2022.0-Product Build 20211112 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.6
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
- CuDNN 8.3.2 (built against CUDA 11.5)
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
2022-09-01 02:08:19,681 - mmdeploy - INFO - TorchVision: 0.13.0
2022-09-01 02:08:19,681 - mmdeploy - INFO - OpenCV: 4.6.0
2022-09-01 02:08:19,681 - mmdeploy - INFO - MMCV: 1.6.1
2022-09-01 02:08:19,681 - mmdeploy - INFO - MMCV Compiler: GCC 9.3
2022-09-01 02:08:19,681 - mmdeploy - INFO - MMCV CUDA Compiler: 11.6
2022-09-01 02:08:19,681 - mmdeploy - INFO - MMDeploy: 0.7.0+21775ce
2022-09-01 02:08:19,681 - mmdeploy - INFO -
2022-09-01 02:08:19,681 - mmdeploy - INFO - **********Backend information**********
2022-09-01 02:08:20,066 - mmdeploy - INFO - onnxruntime: 1.8.1 ops_is_avaliable : True
2022-09-01 02:08:20,091 - mmdeploy - INFO - tensorrt: 8.4.3.1 ops_is_avaliable : True
2022-09-01 02:08:20,105 - mmdeploy - INFO - ncnn: None ops_is_avaliable : False
2022-09-01 02:08:20,106 - mmdeploy - INFO - pplnn_is_avaliable: False
2022-09-01 02:08:20,107 - mmdeploy - INFO - openvino_is_avaliable: False
2022-09-01 02:08:20,121 - mmdeploy - INFO - snpe_is_available: False
2022-09-01 02:08:20,121 - mmdeploy - INFO -
2022-09-01 02:08:20,121 - mmdeploy - INFO - **********Codebase information**********
2022-09-01 02:08:20,122 - mmdeploy - INFO - mmdet: 2.25.1
2022-09-01 02:08:20,122 - mmdeploy - INFO - mmseg: None
2022-09-01 02:08:20,122 - mmdeploy - INFO - mmcls: None
2022-09-01 02:08:20,122 - mmdeploy - INFO - mmocr: None
2022-09-01 02:08:20,122 - mmdeploy - INFO - mmedit: None
2022-09-01 02:08:20,122 - mmdeploy - INFO - mmdet3d: None
2022-09-01 02:08:20,122 - mmdeploy - INFO - mmpose: None
2022-09-01 02:08:20,122 - mmdeploy - INFO - mmrotate: None
And here is my deploy config (/workdir/detection_onnx_static_1024x1024.py):
codebase_config = dict(
type='mmdet',
task='ObjectDetection',
model_type='end2end',
post_processing=dict(
score_threshold=0.05,
confidence_threshold=0.005, # for YOLOv3
iou_threshold=0.5,
max_output_boxes_per_class=200,
pre_top_k=5000,
keep_top_k=100,
background_label_id=-1,
)
)
onnx_config = dict(
type='onnx',
export_params=True,
keep_initializers_as_inputs=False,
opset_version=11,
save_file='dyhead_swin_1024.onnx',
input_names=['input'],
output_names=['dets', 'labels'],
input_shape=[1024, 1024],
optimize=True
)
backend_config = dict(
type='tensorrt',
common_config=dict(fp16_mode=False, max_workspace_size=8 << 30),
model_inputs=[
dict(
input_shapes=dict(
input=dict(min_shape=[1, 3, 1024, 1024],
opt_shape=[1, 3, 1024, 1024],
max_shape=[1, 3, 1024, 1024]
)
)
)
]
)
I also try with the default tensorrt version 8.2.x but no success
Could someone please help? Thanks a lot!!
Issue Analytics
- State:
- Created a year ago
- Comments:37
I’ve done that. Just needed to rescale detected boxes to original image shape. Now I can prove: transformed to TRT dyhead model works fine, boxes and scores are same as obtained from original model. There are some small discrepancies for some detections, I believe they are due to small numerical discrepancies appeared during model transformation and this is normal. Thanks to @hanrui1sensetime @grimoire
BTW, currently I’ve tested only FP32 mode. Gonna do same test for FP16.
@tak-ho-raspect, @hanrui1sensetime , I’ve found the discrepancy between
rewrite_outputs
(in test code upper) disappears if here one replaces index from 1 (taking offsets shape as output shape) to 0 (taking input shape as output shape):ret.d[2] = inputs[0].d[2];
ret.d[3] = inputs[0].d[3];
But after such fix onnx2tensorrt convertion crashes with error about wrong dimensions in graph node
“”" [12/08/2022-01:44:30] [TRT] [I] MatMul_1915: broadcasting input1 to make tensors conform, dims(input0)=[20,144,1536][NONE] dims(input1)=[1,1536,1536][NONE]. [12/08/2022-01:44:30] [TRT] [E] [graphShapeAnalyzer.cpp::analyzeShapes::1285] Error Code 4: Miscellaneous (IElementWiseLayer Add_2075: broadcast dimensions must be conformable) Traceback (most recent call last): File “tools/onnx2tensorrt.py”, line 73, in <module> main() File “tools/onnx2tensorrt.py”, line 58, in main from_onnx( File “/root/workspace/mmdeploy/mmdeploy/backend/tensorrt/utils.py”, line 165, in from_onnx raise RuntimeError(f’Failed to parse onnx, {error_msgs}') RuntimeError: Failed to parse onnx, In node 2075 (parseGraph): INVALID_NODE: Invalid Node - Add_2075 [graphShapeAnalyzer.cpp::analyzeShapes::1285] Error Code 4: Miscellaneous (IElementWiseLayer Add_2075: broadcast dimensions must be conformable) “”"