Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Get nan loss during training

See original GitHub issue

❓ Questions and Help

I am training a model modified from maskrcnn-benchmark. When the model is trained using single GPU , the loss is correctly, but when trained using 4 gpus the model is quite easy get nan. How to solve this problem?

Issue Analytics

State:
Created 4 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

1reaction

yuleichincommented, Jun 14, 2019

I met the same problem. Trained with 4 GPU, the loss is nan from the first few iterations.

0reactions

aurouacommented, Oct 17, 2019

This problem is solved by set the parameter WARMUP_ITERS to 1000.

Read more comments on GitHub >

Top Results From Across the Web

Deep-Learning Nan loss reasons - python - Stack Overflow

You may have an issue with the input data. Try calling assert not np.any(np.isnan(x)) on the input data to make sure you are...

Common Causes of NANs During Training

Common Causes of NANs During Training · Gradient blow up · Bad learning rate policy and params · Faulty Loss function · Faulty...

Getting NaN for loss - General Discussion - TensorFlow Forum

You transform X_train but pass X_train_A and X_train_B into the model, which were never transformed by the scaler and contain negative values.

Debugging a Machine Learning model written in TensorFlow ...

In this article, you get to look over my shoulder as I go about debugging a ... a model that doesn't train, there...

Keras Sequential model returns loss 'nan'

@lcrmorin I'm pretty sure that my dataset doesn't contain nan elements. However, I notice that the loss turn to nan when I changed...

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

ImportError: cannot import name 'save_config'

CUDA host_config error in Windows 10 installation