wmt_en_de admin: Function 'SoftmaxBackward' returned nan values in its 0th output.
See original GitHub issueI was wondering if you ever encountered nan-gradients during admin training. I’m in torch 1.6/CUDA 10.1 with no modifications to the code:
Command
export dd=data-bin/wmt14_en_de_joined_dict
GPUS=0,1,2,3
GPUID=1
TOKEN_NUMBER=8192
UPDATE_FREQUENCE=1
for lnum in 18
do
CUDA_VISIBLE_DEVICES=$GPUID fairseq-train \
$dd -s en -t de \
--arch transformer_wmt_en_de --share-all-embeddings \
--optimizer radam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --max-update 500000 \
--warmup-init-lr 1e-07 --warmup-updates 8000 --lr 0.001 --min-lr 1e-09 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--weight-decay 0.0 --attention-dropout 0.1 --relu-dropout 0.1 \
--max-tokens $TOKEN_NUMBER --update-freq $UPDATE_FREQUENCE \
--save-dir wmt14ende/wmt-admin-${lnum}l --restore-file x.pt --seed 1111 \
--user-dir ../radam_fairseq --log-format simple --log-interval 500 \
--init-type adaptive-profiling --fp16 --fp16-scale-window 256 \
--encoder-layers $lnum --decoder-layers $lnum \
--threshold-loss-scale 0.03125
CUDA_VISIBLE_DEVICES=$GPUS fairseq-train \
$dd -s en -t de \
--arch transformer_wmt_en_de --share-all-embeddings \
--optimizer radam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --max-update 500000 \
--warmup-init-lr 1e-07 --warmup-updates 8000 --lr 0.001 --min-lr 1e-09 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--weight-decay 0.0 --attention-dropout 0.1 --relu-dropout 0.1 \
--max-tokens $TOKEN_NUMBER --update-freq $UPDATE_FREQUENCE \
--save-dir wmt14ende/wmt-admin-${lnum}l --restore-file x.pt --seed 1111 \
--user-dir ../radam_fairseq --log-format simple --log-interval 500 \
--init-type adaptive --fp16 --fp16-scale-window 256 \
--encoder-layers $lnum --decoder-layers $lnum \
--threshold-loss-scale 0.03125 | tee ./wmt14ende/log/loss_admin-${lnum}l.log
bash eval_wmt_en-de.sh wmt14ende/wmt-admin-${lnum}l $GPUID
done
The profiling command works fine, but the second command raises:
Traceback
| WARNING: overflow detected, setting loss scale to: 32.0
| epoch 002 | loss 4.937 | nll_loss 3.371 | ppl 10.34 | wps 24011 | ups 1 | wpb 28913.466 | bsz 942.984 | num_updates 9352 | lr 0.000
924896 | gnorm 0.368 | clip 0.000 | oom 0.000 | loss_scale 32.000 | wall 228 | train_wall 226
Traceback (most recent call last):
File "/private/home/sshleifer/.conda/envs/clinic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq_cli/train.py", line 307, in distributed_main
main(args, init_distributed=True)
File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq_cli/train.py", line 90, in main
train(args, trainer, task, epoch_itr)
File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq_cli/train.py", line 139, in train
log_output = trainer.train_step(samples)
File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq/trainer.py", line 349, in train_step
raise e
File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq/trainer.py", line 311, in train_step
loss, sample_size, logging_output = self.task.train_step(
File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq/tasks/fairseq_task.py", line 264, in train_step
optimizer.backward(loss)
File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq/optim/fp16_optimizer.py", line 103, in backward
loss.backward()
File "/private/home/sshleifer/.conda/envs/clinic/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/private/home/sshleifer/.conda/envs/clinic/lib/python3.8/site-packages/torch/autograd/__init__.py", line 125, in backward
Variable._execution_engine.run_backward(
RuntimeError: Function 'SoftmaxBackward' returned nan values in its 0th output.
contents of profile_ratio.init: https://gist.github.com/sshleifer/b615558499b9b10bd5bee8ddf2db030a
Data directory:
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (5 by maintainers)
Top Results From Across the Web
Function 'SoftmaxBackward' returned nan values in its 0th output
I'm training tacotron2 (a TTS model) using the seq2seq model with attention. I use nvidia apex to train the model with mixed precision, ......
Read more >Function 'LogSoftmaxBackward0' returned nan values in its ...
RuntimeError : Function 'LogSoftmaxBackward0' returned nan values in its 0th output. Why if NUM_OF_CELLS is increased from 8 to 16 , the ...
Read more >Error: Function 'AddmmBackward' returned nan values in its ...
The runtime error is generated during the training phase only. ... By using print(model.fc1.weight.grad) in the code, the output I get is:
Read more >RuntimeError: Function 'MulBackward0' returned nan values ...
RuntimeError : Function 'MulBackward0' returned nan values in its 0th output. I pull the codes from gitlab and did not change them.
Read more >Error:function 'LogSoftmaxBackward' returned nan values in ...
Error:function 'LogSoftmaxBackward' returned nan values in its 0th output. 原因分析. 产生这个问题的原因可能有几种: 1.数据中出现NAN——数据 ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I would add
python 3.6, torch 1.5 or torch 1.6
to the README. I think with those versions and some guidance that training takes a really long time it will make sense. The logs were really helpful, I think I am getting similar results now.This is weird, all admin models on wmt14ende failed in your setting. I compared your logs and my logs, and their development ppls are almost the same. Seems that the training just shut down for some unknown reasons…
One random guess is half-precision training (since it should detects Nan gradients and adjust the scaling accordingly). Maybe you can load the last checkpoint and see whether full precision training can avoid the error?
Another random guess is the distributed training. Maybe you can load the last checkpoint and see whether one-gpu setting can avoid the error (with
UPDATE_FREQUENCE=4
)?Also, it could be caused by OOM. Maybe you can load the last checkpoint and see whether halving the batch size and setting
UPDATE_FREQUENCE=2
can avoid the error?(sorry I’m fully occupied with job applications this week and cannot do experiments with the 1.6 and 10.1 version…)