custom training asserts with "degenerate bboxes" over and over - but bboxes look correct, any debugging insight?
See original GitHub issueI’m trying to get my custom dataset working but I can’t get past 8 or so images via get_item and it keeps asserting that my bboxes are bad…I pull that one, it flags the next one, I pull that one, it flags the next…
From reading the code it wants to check that x1 and y1 are larger than x0 and y0 which is a great check.
55 assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
But it keeps flagging images that when I unwind from coco format should be fine…thus any insights? I was not able to print the boxes1 (200,4) and boxes 2 (12,4) tensors for some reason so I couldn’t see into what it was actually calculating for the results (threw an odd gpu issue with ‘formatting’).
Example it flagged this image as being bad - here’s the JSON for it in coco format, 6 classes. 1 box will surround all the other 5 objects btw as it’s a malaria reader, so not sure if that box encompassing other boxes is really the issue?):
{"id": "c33c3539-8bd1-48e0-8065-831709e5e64d", "image_id": 3091210, "category_id": 2905442, "segmentation": null, "area": 0, "bbox": **[499, 121, 177, 80]**, "iscrowd": 0},
{"id": "0023d71e-e1e9-4862-a0b8-6e2bc3982b3b", "image_id": 3091210, "category_id": 2905422, "segmentation": null, "area": 0, "bbox": **[492, 523, 187, 163]**, "iscrowd": 0},
{"id": "726fdfbc-3801-409d-ab75-ccf951e74316", "image_id": 3091210, "category_id": 2905421, "segmentation": null, "area": 0, "bbox": **[496, 428, 181, 93],** "iscrowd": 0},
{"id": "2bf85a8e-108d-4875-b0f5-47c8e5cb13e0", "image_id": 3091210, "category_id": 2905420, "segmentation": null, "area": 0, "bbox": **[494, 272, 186, 169]**, "iscrowd": 0},
{"id": "8669c13a-1205-4e94-a645-18e2ffa491d0", "image_id": 3091210, "category_id": 2905419, "segmentation": null, "area": 0, "bbox": **[489, 127, 193, 557]**, "iscrowd": 0},
{"id": "d9619859-e0ef-4632-ad51-7237a5760a5e", "image_id": 3091210, "category_id": 2905418, "segmentation": null, "area": 0, "bbox": **[495, 203, 182, 73]**, "iscrowd": 0},
And as a check for me, here’s coco format: The COCO bounding box format is [top left x position, top left y position, width, height].
All the bboxes which it flags, are positive numbers for width and height, so the x1 and y1 must be larger than x0 and y0 - only a negative number added to the original x0 or y0 could result in it being smaller…so I’m unclear what it is asserting on or for.
But it asserts here:
~/detr/util/box_ops.py in generalized_box_iou(boxes1, boxes2)
53 #print(boxes1)
54 #print(boxes2)
---> 55 assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
56 assert (boxes2[:, 2:] >= boxes2[:, :2]).all()
57 iou, union = box_iou(boxes1, boxes2)
I’ve removed 15+ images trying to get it to actually train but just keeps flagging more and more as invalid bboxes. I remove one image, then it asserts on the next one…and in reviewing the ones it flags vs the ones it lets pass, I don’t see any real difference. (I have trained with this same dataset on EffficientDet so I know the dataset is reasonable).
Thus any help into debugging, or what might be awry would be appreciated. Thanks!
Issue Analytics
- State:
- Created 3 years ago
- Comments:21 (20 by maintainers)
Top GitHub Comments
Hi @lessw2020 apologies for the confusion, the class IDs need to be remapped to [0, 6]. Basically you want
tgt_ids.max() < num_classes
EDIT: to clarify, in our case for COCO there is 80 classes with labels in [0, 90], and for simplicity we don’t do remapping so we use num_classes=91 (so that we satisfy the inequality above). It doesn’t matter that some ids will never be used (it’s a slight waste of parameters, but negligible in this case). In your case it won’t work though, you really don’t want to have a softmax over 2.9M elements, so remapping is the way to go.
@lessw2020 from
debug2.txt
, the error comes fromwhich indicates that your your class probability has fewer elements than the ground-truth indices.
If you add a
print(tgt_ids.max())
in your code, you’ll see that it is larger than 6, which means that there might be an issue with your dataset (as you have more classes than you thought). I believe this is probably the issue that you are facing.As an unrelated note, I noted that you are passing
--no_aux_loss
to the model – note that our best results are obtained with aux_loss. The evaluation code doesn’t need aux loss because it’s just evaluation and it is slightly faster, but for training in general it’s better to useaux_loss
.