Loss becomes nan after certain epoch.
See original GitHub issueHi!
Try to train tiny model on wireframe dataset(for faster checking issue with loss, make dataset smaller):
After certain step loss returns nan. Debug shows that some mask in
weighted_bce_with_logits
is fully zeros, and division on the torch.sum(mask)
returns nan.
How to fix?
Will appreciate any help.
Issue Analytics
- State:
- Created 2 years ago
- Comments:16 (1 by maintainers)
Top Results From Across the Web
Neural Network after first epoch generates NaN values as ...
problem is after I am running it the output and loss function are turning into NaN value: epoch: 0, optimizer: None, loss: inf;...
Read more >Cost function turning into nan after a certain number of iterations
Your input contains nan (or unexpected values); Loss function not implemented properly; Numerical instability in the Deep learning framework.
Read more >loss becomes nan after 1st epoch while training 5 folds. #766
1st fold ran successfully but loss became nan at the 2nd epoch of the 2nd fold. The problem is 1457 train images because...
Read more >Common Causes of NANs During Training
Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears....
Read more >Training data becomes nan after several epochs - vision
I would recommend you to check if the loss is inf or is NaN. If you backpropagate a NaN or an Inf, all...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
It looks like I found one of the reasons for this issue. Let’s look at this piece of code in mlsd_pytorch/data/wireframe_dset.py:
Providing that the input size is 512x512, junction_map and line_map are (256, 256, 1) Numpy arrays. Accordingly,
junction_map[0]
andline_map[0]
are (256, 1). As Numpy does broadcasting whenever it’s necessary and possible, the code in rows 334-335 executes without errors, but maps inlabel[14, ...]
andlabel[15, ...]
are incorrect. One of the possible solutions is to changejunction_map[0]
tojunction_map[:, :, 0]
andline_map[0]
toline_map[:, :, 0]
.I didn’t solved the issue because also for me by using this PyTorch implementation the results are not as good as in the original paper. The issue is NOT related to the learning rate but it is simply related to an issue in reading the ground truth masks computed. Moreover notice that the last time that I tried, also the Loss functions have some bugs (no matching loss) and this can strongly affect the final sAP_10.