Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trouble training custom dataset

See original GitHub issue

Training Detectron on custom dataset

I’m trying to train Mask RCNN on my custom dataset to perform segmentation task on new classes that coco or ImageNet never seen.

I first converted my dataset to coco format so it can be loaded by pycocotools.
I added my dataset path into dataset_catalog.py and created the correct link to images directory and annotations path. The config file I used is based on configs/getting_started/tutorial_1gpu_e2e_faster_rcnn_R-50-FPN.yaml . My dataset contains only 4 classes without background so I set NUM_CLASSES to 5 ( 4 does not work either). When I try to train using the command bellow : python2 tools/train_net.py --cfg configs/encov/copy_maskrcnn_R-101-FPN.yaml OUTPUT_DIR /tmp/detectron-output/

ERROR 1:

I get the following error (complete log file is here output.txt) At: /home/encov/Softwares/Detectron/lib/roi_data/fast_rcnn.py(269): _expand_bbox_targets /home/encov/Softwares/Detectron/lib/roi_data/fast_rcnn.py(181): _sample_rois /home/encov/Softwares/Detectron/lib/roi_data/fast_rcnn.py(112): add_fast_rcnn_blobs /home/encov/Softwares/Detectron/lib/ops/collect_and_distribute_fpn_rpn_proposals.py(62): forward terminate called after throwing an instance of 'caffe2::EnforceNotMet' what(): [enforce fail at pybind_state.h:423] . Exception encountered running PythonOp function: ValueError: could not broadcast input array from shape (4) into shape (0)

This error comes from the expand box procedure that increase the size of bounding box weights by 4 (see roi_data/fast_rcnn.py). It basically takes the first element which represents the class, checks that it is not 0 (the background) and copy weights values at index_class x 4. Error happens because the index is greater than the NUM_CLASSES parameter which has been used to create the output array.

ERROR 2

I try same training except I set NUM_CLASSES to 81 which was the number of classes used for coco training which is working on my set-up by the way. The error I described above does not appear but in the really early beginning of the the iterations, bounding box areas is null which cause some divisions by zero. output2.txt

Has someone experienced the same issue for training fast rcnn or mask rcnn on a custom dataset ? I really suspect an error in my json coco-like file because training on coco dataset in working correctly. Thank you for your help,

System information

Operating system: Ubuntu 16.04
Compiler version: GCC 5.4.0
CUDA version: 8.0
cuDNN version: 7.0
NVIDIA driver version: 384
GPU model: GeForce GTX 1080 (x1)
python --version output: Python 2.7.12

Issue Analytics

State:
Created 6 years ago
Comments:30

Top GitHub Comments

5reactions

francotocommented, Mar 7, 2018

I finally made it:

first, the bounding box coordinates in my dataset were wrong. I realize my mistakes when I tried to visualize them using pycocotools API (which by default doesn’t have a specific method to show them by the way).
Finally, I misunderstood the part where I need a ‘background’ class (for labelling every pixel not in other classes) so I add one in my dataset but actually json_datatset.py is creating its own one. Delete my ‘background’ label in my dataset allows me to finally start the training.

4reactions

realwecancommented, Feb 22, 2018

How many classes do you have in your custom dataset? If you have N classes, then you should set NUM_CLASSES: N+1 in your yaml config file. For example, for six classes you should set NUM_CLASSES: 7. For 80 classes COCO you should set it to 81.