"nan" loss on training
See original GitHub issueHi!
Thanks for releasing the library. I’m encountering “nan” loss on training with the following commit, which I think is the most recent version: 60f35edc52862109555f4acf66236becc29705ad
Here are instructions to reproduce:
pip install -r ./requirements.txt
bash ./scripts/download_ud_data.sh
python train.py --config config/ud/en/udify_bert_train_en_ewt.json --name en_ewt --dataset_dir data/ud-treebanks-v2.3/
The end of the training is this:
2020-06-10 16:23:38,177 - INFO - allennlp.training.trainer - Training
0%| | 0/392 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 110, in <module>
train_model(train_params, serialization_dir, recover=bool(args.resume))
File "/home/gneubig/anaconda3/envs/python3/lib/python3.7/site-packages/allennlp/commands/train.py", line 252, in train_model
metrics = trainer.train()
File "/home/gneubig/anaconda3/envs/python3/lib/python3.7/site-packages/allennlp/training/trainer.py", line 478, in train
train_metrics = self._train_epoch(epoch)
File "/home/gneubig/anaconda3/envs/python3/lib/python3.7/site-packages/allennlp/training/trainer.py", line 323, in _train_epoch
raise ValueError("nan loss encountered")
ValueError: nan loss encountered
I’ve attached the full log below as well: udify-log.txt
My pip environment is also here: pip-list.txt
Do you have an idea what the issue is? I’d be happy to help debug further (cc: @antonisa and @LeYonan)
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (5 by maintainers)
Top Results From Across the Web
Deep-Learning Nan loss reasons - python - Stack Overflow
In my case I got NAN when setting distant integer LABELs. ie: ... In your training data you might have 0.0 and thus...
Read more >Common Causes of NANs During Training
Common Causes of NANs During Training · Gradient blow up · Bad learning rate policy and params · Faulty Loss function · Faulty...
Read more >Debugging a Machine Learning model written in TensorFlow ...
“NaN loss during training.” If there is anything that will strike terror into the hearts of a ML practitioner, it's NaN losses. Oh,...
Read more >Getting NaN for loss - General Discussion - TensorFlow Forum
Hi! The problem is not in the concatenation layer but in how you normalize the input data and how you pass it to...
Read more >'Training loss' gets 'nan' when training deeplearning model
... extract greenspace from the drone. but when I training the deep learning model, the 'Training loss' and validation loss returns 'nan', and....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks for looking into it @dfvalio.
I tried with the
conllu
version that gets installed with allennlp0.9.0
:conllu==1.3.1
and alsoconllu==2.3.2
and even the current versionconllu==4.4
and training runs with all three versions for me.I ran a diff on the packages which are installed by
requirements.txt
and a working environment I had and I started changing packages to those in my working environment until I could launch the training command. The first time it worked was when I changed thegevent
package:After changing this package version the training script runs fine. Can you confirm this works for you as well @Hyperparticle @dfvalio? To reproduce:
Hi @neubig, @Hyperparticle,
I thought this might have something to do with some unexpected behaviour since the recent PR #13 but I cloned a version of udify c277adea9295b05772e3b508d05ce13aea8bde03 before this PR was merged and I still get a nan loss almost immediately after commencing training.
It is just a hunch but when
allennlp==0.9.0
is installed it looks for torch >=1.2.0, which installs version 1.5.0 which might be too recent a version.I have some local environments where udify can train successfully and I am able to train in a fresh environment if I install those requirements instead:
This is now not producing nan losses for me so it might be something to do with the versions of the software which are installed in requirements.txt. Oddly enough, I found installing these requirements to work too.
The requirements I used are here: reqs.txt