Is there a way to use images with no bounding boxes to train a FasterRCNN?
See original GitHub issueHi, I am currently using version 1.7.1 of pytorch with the corresponding version of torchvision. Everything is going perfect while training a FasterRCNN FPN 50 using your reference training project. That is until I added images with no bounding boxes to the training data, then the whole thing broke down.
I need background images with no bounding boxes because the model needs to learn that certain types of images will not have objects in them, and if I only feed it images with objects in them, it will never see these types of images and predict nonsense when it gets them during inference.
Is this feature supported? The error I currently get when trying to train is:
/home/joaqo/.local/lib/python3.6/site-packages/torchvision/models/detection/generalized_rcnn.py 69 forward
boxes.shape))
ValueError:
Expected target boxes to be a tensorof shape [N, 4], got torch.Size([0]).
Which makes perfect sense if you want to force your model to only train on images with targets, regrettably thats not my use case. If this is not currently supported? Would anyone provide some pointers for me to add support for this?
Thanks.
Issue Analytics
- State:
- Created 2 years ago
- Comments:12 (12 by maintainers)
Top GitHub Comments
Hi @joaqo !
Yes it is possible to use images with no bounding boxes in training data with torchvision. This feature was implemented from release 0.6 Here is a note on how to use it.
If you have background image with No bounding boxes at all. The boxes are assumed to be a
[0, 0, 0, 0]
, hence the area is 0. Also the label is assumed as background label which is 0 (predefined by torchvision models).For e.g if such image has
image_id = 3
. So for such images you need to pass.I think the reference script will work out of box, if you make changes to your dataset.
Feel free to post problems you face in this issue.
Internally FasterRCNN (and many other object detection models) divides the image into many, sometimes overlapping, areas called anchors. For each of these anchors it predicts if they’re one of the foreground classes or the background class. When training, the model grabs a bunch of these anchors in different strategic ways, tries to balance a decent propotion between foreground and background classes (look here and here for the proportions in FasterRCNN), and computes the loss function using them.
So the model is trying to learn how well it did on the anchors it predicted as foreground but also on the ones it predicted as background. So, you see that training on an image with no annotations is not really a problem, it’s as if you were just setting the proportion of foreground/background to just be background during that one sample. You cannot do it for all samples, but for a subset of them its fine.