Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training error. bg_num_rois = 0 and fg_num_rois = 0, this should not happen!

See original GitHub issue

Hi, I meet some proｂlems when training. The error message is as follows:

ValueError: bg_num_rois = 0 and fg_num_rois = 0, this should not happen!

And I find before the error, the loss has turned to nan, and I followed some suggestions like climp gradient or reduce lr, none of them worked.

[session 1][epoch  1][iter  300/2164] loss: nan, lr: 1.00e-04
			fg/bg=(128/0), time cost: 29.000118

I checked my annotation files, some xmin is 0, I don’t know if it is the problem, because I plus xmin to 1, it’s not work. And I print gt_boxes and I found xmin is more than 64041, appａrently it’s not right.

gt_boxes is tensor([[[6.4041e+04, 1.7687e+02, 2.2182e+02, 4.3876e+02, 2.0000e+00],
         [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
         [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
         [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
         [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00]]],

So I think there is somewhere wrong about compute the gt_boxes in your code, but it hard to find out, could you give me a clue about how to fix it? Thank for your kindly reply!

Issue Analytics

State:
Created 4 years ago
Reactions:2
Comments:5

Top GitHub Comments

8reactions

marcunzuetacommented, Aug 4, 2019

Hi, I found the same bug while trying to create my own data with the images from OpenImage for the Kaggle competition.

Check in: https://github.com/jwyang/faster-rcnn.pytorch/blob/358cecacf876717ff13988dc6396de10e265279c/lib/datasets/pascal_voc.py#L234-L237 of your new generated dataset .py file e.g: openimage.py I recommend you copy the pascal_voc.py and work from there. Delete the -1.

Moreover, change in: https://github.com/jwyang/faster-rcnn.pytorch/blob/358cecacf876717ff13988dc6396de10e265279c/lib/datasets/imdb.py#L121-L122 delete the -1.

There are objects where the bbox are 0,1,0, for example, which makes either the code the crash due to an assertion error or the loss to become nan. If you are using a dataset with some bbox annotations that are either 0 or equal to the image width, apply the changes.

hope it helps! 😃

5reactions

z-huabaocommented, Aug 12, 2019

make sure x2 and y2 < width because it will flip image and annotation

        wh = tree.find('size')
        w, h = int(wh.find('width').text), int(wh.find('height').text)
        for ix, obj in enumerate(objs):
            bbox = obj.find('bndbox')
            # Make pixel indexes 0-based
            x1 = float(bbox.find('xmin').text)
            y1 = float(bbox.find('ymin').text)
            x2 = float(bbox.find('xmax').text)
            y2 = float(bbox.find('ymax').text)
            x1 = max(x1, 0)
            y1 = max(y1, 0)
            x2 = min(x2, w)
            y2 = min(y2, h)

Top Results From Across the Web

Does zero training error mean zero bias? - Cross Validated

Say your biased classifier always predicts zero, but your dataset happens to be all labeled zero. zero bias =/> zero training error. Zero...

3. Training error vs Test error - YouTube

Your browser can 't play this video. ... Training error vs Test error ... The Elements of Statistical Learning: Data Mining, Inference, ...

10-701/15-781 Machine Learning - Midterm Exam, Fall 2010

SOLUTION: First w1 will become 0, then w2. The data can be classified with zero training error and therefore also with high log-...

datasciencecoursera/AdviceQuiz.md at master - GitHub

The gap in errors between training and test suggests a high variance problem in which the algorithm has overfit the training set. Decreasing...

Training & Test Error: Validating Models in Machine Learning

Possibly. But often it is not the model that's wrong, but how the model was validated. A wrong validation delivers over-optimistic expectations ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

"RuntimeError: The expanded size of the tensor (1200) must match the existing size (1199) at non-singleton dimension 1. Target sizes: [600, 1200, 3]. Tensor sizes: [600, 1199, 3] "

Training error. bg_num_rois = 0 and fg_num_rois = 0, this should not happen!

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

"RuntimeError: The expanded size of the tensor (1200) must match the existing size (1199) at non-singleton dimension 1. Target sizes: [600, 1200, 3]. Tensor sizes: [600, 1199, 3] "

AttributeError: 'int' object has no attribute 'astype'