Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Missing `fixed_size` value in GeneralizedRCNNTransform breaks Faster-RCNN torchscript loading with C++ in train mode

See original GitHub issue

🐛 Describe the bug

fixed_size value in GeneralizedRCNNTransform instantiation for faster_rcnn defaults to None which breaks torchcript inference in C++.

See https://github.com/pytorch/vision/blob/7947fc8fb38b1d3a2aca03f22a2e6a3caa63f2a0/torchvision/models/detection/faster_rcnn.py#L234 and compare to https://github.com/pytorch/vision/blob/7947fc8fb38b1d3a2aca03f22a2e6a3caa63f2a0/torchvision/models/detection/ssd.py#L203 where fixed_size is explicitely set.

Thus with faster_rcnn, fixed_size defaults to None and loading from C++ yields:

Dynamic exception type: torch::jit::ErrorReport
std::exception::what: 
Unknown type name 'NoneType':
Serialized   File "code/__torch__/torchvision/models/detection/transform.py", line 11
  image_std : List[float]
  size_divisible : int
  fixed_size : NoneType
               ~~~~~~~~ <--- HERE
  def forward(self: __torch__.torchvision.models.detection.transform.GeneralizedRCNNTransform,
    images: List[Tensor],

To reproduce, we export the model with torch.jit.script for fasterrcnn_resnet50_fpn and we load from C++ with torch::jit::load().

Actually the exact export Python code we use is here: https://github.com/jolibrain/deepdetect/blob/master/tools/torch/trace_torchvision.py and we run:

python3 trace_torchvision.py fasterrcnn_resnet50_fpn --num_classes 2

Versions

PyTorch version: 1.9.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
CMake version: version 3.21.1
Libc version: glibc-2.25

Python version: 3.6.9 (default, Jan 26 2021, 15:33:00)  [GCC 8.4.0] (64-bit runtime)
Python platform: Linux-4.15.0-151-generic-x86_64-with-Ubuntu-18.04-bionic
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: NVIDIA RTX A5000
GPU 1: NVIDIA TITAN X (Pascal)
GPU 2: NVIDIA GeForce GTX 1080 Ti

Nvidia driver version: 470.57.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.2
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.18.1
[pip3] torch==1.9.0+cu111
[pip3] torchaudio==0.9.0
[pip3] torchfile==0.1.0
[pip3] torchvision==0.10.0+cu111
[pip3] torchviz==0.0.1
[conda] Could not collect
```

cc @datumbox

Issue Analytics

State:
Created 2 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

2reactions

benizcommented, Sep 6, 2021

@datumbox Thanks very much. So further tests on our side did reveal that the C++ build was using torch 1.8, while with 1.9 there’s no error. My deepest apologies for the time required on your side, maybe PR #4369 remains useful if only by principle of having properly typed signatures. I’m closing this issue, thanks again for this and for the excellent work by the torchvision team!

1reaction

datumboxcommented, Sep 6, 2021

@beniz I’ve temporarily modified a similar test that we have at vision here to export the model on train mode. I then passed data through it and I don’t get any errors, see here.

Without being able to properly reproduce the error you see, it’s hard to provide guidance. Would you be able to send a dummy PR where you modify the above scripts in a way that they get similar to your setup and reproduce the error on our CI (see the linked commit above for example)? If you manage to reproduce it with a minimal example, I can help you investigate further.