Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

"nan" loss on training

See original GitHub issue

Hi!

Thanks for releasing the library. I’m encountering “nan” loss on training with the following commit, which I think is the most recent version: 60f35edc52862109555f4acf66236becc29705ad

Here are instructions to reproduce:

pip install -r ./requirements.txt
bash ./scripts/download_ud_data.sh
python train.py --config config/ud/en/udify_bert_train_en_ewt.json --name en_ewt --dataset_dir data/ud-treebanks-v2.3/

The end of the training is this:

2020-06-10 16:23:38,177 - INFO - allennlp.training.trainer - Training
  0%|          | 0/392 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train.py", line 110, in <module>
    train_model(train_params, serialization_dir, recover=bool(args.resume))
  File "/home/gneubig/anaconda3/envs/python3/lib/python3.7/site-packages/allennlp/commands/train.py", line 252, in train_model
    metrics = trainer.train()
  File "/home/gneubig/anaconda3/envs/python3/lib/python3.7/site-packages/allennlp/training/trainer.py", line 478, in train
    train_metrics = self._train_epoch(epoch)
  File "/home/gneubig/anaconda3/envs/python3/lib/python3.7/site-packages/allennlp/training/trainer.py", line 323, in _train_epoch
    raise ValueError("nan loss encountered")
ValueError: nan loss encountered

I’ve attached the full log below as well: udify-log.txt

My pip environment is also here: pip-list.txt

Do you have an idea what the issue is? I’d be happy to help debug further (cc: @antonisa and @LeYonan)

Issue Analytics

State:
Created 3 years ago
Comments:11 (5 by maintainers)

Top GitHub Comments

4reactions

jbrrycommented, Feb 18, 2021

Thanks for looking into it @dfvalio.

Fixed my problem with the requirement specified by you. CONLLU must remain at 1.3.1

I tried with the conllu version that gets installed with allennlp 0.9.0: conllu==1.3.1 and also conllu==2.3.2 and even the current version conllu==4.4 and training runs with all three versions for me.

I ran a diff on the packages which are installed by requirements.txt and a working environment I had and I started changing packages to those in my working environment until I could launch the training command. The first time it worked was when I changed the gevent package:

pip install gevent==1.4.0

After changing this package version the training script runs fine. Can you confirm this works for you as well @Hyperparticle @dfvalio? To reproduce:

conda create -n udify_install python=3.7
conda activate udify_install
pip install -r requirements.txt 

# BREAKS
python train.py --config config/ud/en/udify_bert_finetune_en_ewt.json --name en_ewt --dataset_dir data/ud-treebanks-v2.3/ # doesn't work

pip install gevent==1.4.0

# WORKS
python train.py --config config/ud/en/udify_bert_finetune_en_ewt.json --name en_ewt --dataset_dir data/ud-treebanks-v2.3/

2reactions

jbrrycommented, Jun 13, 2020

Hi @neubig, @Hyperparticle,

I thought this might have something to do with some unexpected behaviour since the recent PR #13 but I cloned a version of udify c277adea9295b05772e3b508d05ce13aea8bde03 before this PR was merged and I still get a nan loss almost immediately after commencing training.

It is just a hunch but when allennlp==0.9.0 is installed it looks for torch >=1.2.0, which installs version 1.5.0 which might be too recent a version.

Collecting torch>=1.2.0
  Using cached torch-1.5.0-cp37-cp37m-manylinux1_x86_64.whl (752.0 MB)

I have some local environments where udify can train successfully and I am able to train in a fresh environment if I install those requirements instead:

# put versions of libraries from a working environment into `reqs.txt`
pip freeze > reqs.txt

conda create -n udify_alternative_reqs python=3.7
conda activate udify_alternative_reqs
pip install -r reqs.txt

python train.py --config config/ud/en/udify_bert_finetune_en_ewt.json --name en_ewt --dataset_dir data/ud-treebanks-v2.3/

This is now not producing nan losses for me so it might be something to do with the versions of the software which are installed in requirements.txt. Oddly enough, I found installing these requirements to work too.

The requirements I used are here: reqs.txt