Get nan loss during training
See original GitHub issue❓ Questions and Help
I am training a model modified from maskrcnn-benchmark
. When the model is trained using single GPU
, the loss is correctly, but when trained using 4 gpus
the model is quite easy get nan
. How to solve this problem?
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (1 by maintainers)
Top Results From Across the Web
Deep-Learning Nan loss reasons - python - Stack Overflow
You may have an issue with the input data. Try calling assert not np.any(np.isnan(x)) on the input data to make sure you are...
Read more >Common Causes of NANs During Training
Common Causes of NANs During Training · Gradient blow up · Bad learning rate policy and params · Faulty Loss function · Faulty...
Read more >Getting NaN for loss - General Discussion - TensorFlow Forum
You transform X_train but pass X_train_A and X_train_B into the model, which were never transformed by the scaler and contain negative values.
Read more >Debugging a Machine Learning model written in TensorFlow ...
In this article, you get to look over my shoulder as I go about debugging a ... a model that doesn't train, there...
Read more >Keras Sequential model returns loss 'nan'
@lcrmorin I'm pretty sure that my dataset doesn't contain nan elements. However, I notice that the loss turn to nan when I changed...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I met the same problem. Trained with 4 GPU, the loss is nan from the first few iterations.
This problem is solved by set the parameter
WARMUP_ITERS
to 1000.