Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Loss becomes nan after certain epoch.

See original GitHub issue

Hi! Try to train tiny model on wireframe dataset(for faster checking issue with loss, make dataset smaller): After certain step loss returns nan. Debug shows that some mask in weighted_bce_with_logits is fully zeros, and division on the torch.sum(mask) returns nan.

How to fix?

Will appreciate any help.

Issue Analytics

State:
Created 2 years ago
Comments:16 (1 by maintainers)

Top GitHub Comments

16reactions

kushnir95commented, Jul 8, 2022

It looks like I found one of the reasons for this issue. Let’s look at this piece of code in mlsd_pytorch/data/wireframe_dset.py: Providing that the input size is 512x512, junction_map and line_map are (256, 256, 1) Numpy arrays. Accordingly, junction_map[0] and line_map[0] are (256, 1). As Numpy does broadcasting whenever it’s necessary and possible, the code in rows 334-335 executes without errors, but maps in label[14, ...] and label[15, ...] are incorrect. One of the possible solutions is to change junction_map[0] to junction_map[:, :, 0] and line_map[0] to line_map[:, :, 0].

1reaction

michelebechinicommented, Jul 6, 2022

hi @michelebechini

The error is in the ground truth mask of the line segmentation that becomes all zero after few epochs when it is used to compute the line segmentation loss. It is not related to the learning rate, I tried a train by excluding the line segmentation loss from the loop and it seems to work fine.

Did you solve this issue? I trained the model on Wireframe dataset, but the result was not good, the sAP_10 was stuck at ~30 (6x in paper). How can I improve this training process? Thanks

I didn’t solved the issue because also for me by using this PyTorch implementation the results are not as good as in the original paper. The issue is NOT related to the learning rate but it is simply related to an issue in reading the ground truth masks computed. Moreover notice that the last time that I tried, also the Loss functions have some bugs (no matching loss) and this can strongly affect the final sAP_10.