question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Loss becomes nan after certain epoch.

See original GitHub issue

Hi! Try to train tiny model on wireframe dataset(for faster checking issue with loss, make dataset smaller): image After certain step loss returns nan. Debug shows that some mask in weighted_bce_with_logits is fully zeros, and division on the torch.sum(mask) returns nan.

How to fix?

Will appreciate any help.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:16 (1 by maintainers)

github_iconTop GitHub Comments

16reactions
kushnir95commented, Jul 8, 2022

It looks like I found one of the reasons for this issue. Let’s look at this piece of code in mlsd_pytorch/data/wireframe_dset.py: image Providing that the input size is 512x512, junction_map and line_map are (256, 256, 1) Numpy arrays. Accordingly, junction_map[0] and line_map[0] are (256, 1). As Numpy does broadcasting whenever it’s necessary and possible, the code in rows 334-335 executes without errors, but maps in label[14, ...] and label[15, ...] are incorrect. One of the possible solutions is to change junction_map[0] to junction_map[:, :, 0] and line_map[0] to line_map[:, :, 0].

1reaction
michelebechinicommented, Jul 6, 2022

hi @michelebechini

The error is in the ground truth mask of the line segmentation that becomes all zero after few epochs when it is used to compute the line segmentation loss. It is not related to the learning rate, I tried a train by excluding the line segmentation loss from the loop and it seems to work fine.

Did you solve this issue? I trained the model on Wireframe dataset, but the result was not good, the sAP_10 was stuck at ~30 (6x in paper). How can I improve this training process? Thanks

I didn’t solved the issue because also for me by using this PyTorch implementation the results are not as good as in the original paper. The issue is NOT related to the learning rate but it is simply related to an issue in reading the ground truth masks computed. Moreover notice that the last time that I tried, also the Loss functions have some bugs (no matching loss) and this can strongly affect the final sAP_10.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Neural Network after first epoch generates NaN values as ...
problem is after I am running it the output and loss function are turning into NaN value: epoch: 0, optimizer: None, loss: inf;...
Read more >
Cost function turning into nan after a certain number of iterations
Your input contains nan (or unexpected values); Loss function not implemented properly; Numerical instability in the Deep learning framework.
Read more >
loss becomes nan after 1st epoch while training 5 folds. #766
1st fold ran successfully but loss became nan at the 2nd epoch of the 2nd fold. The problem is 1457 train images because...
Read more >
Common Causes of NANs During Training
Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears....
Read more >
Training data becomes nan after several epochs - vision
I would recommend you to check if the loss is inf or is NaN. If you backpropagate a NaN or an Inf, all...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found