Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Strange behavior in detection reference code

See original GitHub issue

🐛 Describe the bug

Hi,

I am using your training procedure for object detection https://github.com/pytorch/vision/blob/main/references/detection/train.py with a custom dataset. When I evaluate my model, the output seems correct for the very first epoch but for the following epochs, the metrics fall to 0. However, this is not related to the model performance as I can use a checkpoint and evaluate it in another process, which gives back expected values.

From the code, the difference between COCO dataset and a custom dataset happens here: https://github.com/pytorch/vision/blob/f4fd19335fca4dbb987603c08368be9496dd316d/references/detection/coco_utils.py#L203

I suppose that the current behavior is not expected. Have you ever faced a similar issue and how can I correct it ?

Renaud

Versions

Collecting environment information… PyTorch version: 1.10.1 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A

OS: CentOS Linux release 8.2.2004 (Core) (x86_64) GCC version: (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5) Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.28

Python version: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] (64-bit runtime) Python platform: Linux-4.18.0-193.6.3.el8_2.x86_64-x86_64-with-glibc2.28 Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce RTX 2080 Ti Nvidia driver version: 450.57 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] numpy==1.21.2 [pip3] torch==1.10.1 [pip3] torchvision==0.11.2 [conda] blas 1.0 mkl [conda] cudatoolkit 11.3.1 h2bc3f7f_2 [conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] mkl 2021.4.0 h06a4308_640 [conda] mkl-service 2.4.0 py39h7f8727e_0 [conda] mkl_fft 1.3.1 py39hd3c417c_0 [conda] mkl_random 1.2.2 py39h51133e4_0 [conda] numpy 1.21.2 py39h20f2e39_0 [conda] numpy-base 1.21.2 py39h79a1101_0 [conda] pytorch 1.10.1 py3.9_cuda11.3_cudnn8.2.0_0 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] torchvision 0.11.2 py39_cu113 pytorch

cc @datumbox

Issue Analytics

State:
Created 2 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

rvandeghencommented, Jan 26, 2022

@datumbox Indeed the problem comes from bboxes[:, 2:] -= bboxes[:, :2] and is solved by cloning it. I also checked that the masks were not changed with the current version and it works fine.

1reaction

datumboxcommented, Jan 26, 2022

@rvandeghen Thanks for raising this.

It is very difficult to tell what’s the problem given that I can’t reproduce the issue without having your custom dataset. What’s unclear to me is why you need to deep-copy the ds given you don’t modify it.

do you think it is useful to make a PR with this change to solve this issue ?

The reference script works fine with Coco which is its intended use-case. Moreover the script serves as a starting point for how one can build their own loops. With the info I have at this point, I don’t think applying the deep-copy patch in the general case is worth it.