Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

wmt_en_de admin: Function 'SoftmaxBackward' returned nan values in its 0th output.

See original GitHub issue

I was wondering if you ever encountered nan-gradients during admin training. I’m in torch 1.6/CUDA 10.1 with no modifications to the code:

Command

export dd=data-bin/wmt14_en_de_joined_dict
GPUS=0,1,2,3
GPUID=1
TOKEN_NUMBER=8192
UPDATE_FREQUENCE=1
for lnum in 18
do
  CUDA_VISIBLE_DEVICES=$GPUID fairseq-train \
    $dd -s en -t de \
    --arch transformer_wmt_en_de --share-all-embeddings \
    --optimizer radam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt --max-update 500000 \
    --warmup-init-lr 1e-07 --warmup-updates 8000 --lr 0.001 --min-lr 1e-09  \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --weight-decay 0.0 --attention-dropout 0.1 --relu-dropout 0.1 \
    --max-tokens $TOKEN_NUMBER --update-freq $UPDATE_FREQUENCE \
    --save-dir wmt14ende/wmt-admin-${lnum}l --restore-file x.pt --seed 1111 \
    --user-dir ../radam_fairseq --log-format simple --log-interval 500 \
    --init-type adaptive-profiling --fp16 --fp16-scale-window 256 \
    --encoder-layers $lnum --decoder-layers $lnum \
    --threshold-loss-scale 0.03125 

  CUDA_VISIBLE_DEVICES=$GPUS fairseq-train \
    $dd -s en -t de \
    --arch transformer_wmt_en_de --share-all-embeddings \
    --optimizer radam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt --max-update 500000 \
    --warmup-init-lr 1e-07 --warmup-updates 8000 --lr 0.001 --min-lr 1e-09  \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --weight-decay 0.0 --attention-dropout 0.1 --relu-dropout 0.1 \
    --max-tokens $TOKEN_NUMBER --update-freq $UPDATE_FREQUENCE \
    --save-dir wmt14ende/wmt-admin-${lnum}l --restore-file x.pt --seed 1111 \
    --user-dir ../radam_fairseq --log-format simple --log-interval 500 \
    --init-type adaptive --fp16 --fp16-scale-window 256 \
    --encoder-layers $lnum --decoder-layers $lnum \
    --threshold-loss-scale 0.03125 | tee ./wmt14ende/log/loss_admin-${lnum}l.log

  bash eval_wmt_en-de.sh wmt14ende/wmt-admin-${lnum}l $GPUID 
done

The profiling command works fine, but the second command raises:

Traceback

| WARNING: overflow detected, setting loss scale to: 32.0
| epoch 002 | loss 4.937 | nll_loss 3.371 | ppl 10.34 | wps 24011 | ups 1 | wpb 28913.466 | bsz 942.984 | num_updates 9352 | lr 0.000
924896 | gnorm 0.368 | clip 0.000 | oom 0.000 | loss_scale 32.000 | wall 228 | train_wall 226
Traceback (most recent call last):
  File "/private/home/sshleifer/.conda/envs/clinic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq_cli/train.py", line 307, in distributed_main
    main(args, init_distributed=True)
  File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq_cli/train.py", line 90, in main
    train(args, trainer, task, epoch_itr)
  File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq_cli/train.py", line 139, in train
    log_output = trainer.train_step(samples)
  File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq/trainer.py", line 349, in train_step
    raise e
  File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq/trainer.py", line 311, in train_step
    loss, sample_size, logging_output = self.task.train_step(
  File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq/tasks/fairseq_task.py", line 264, in train_step
    optimizer.backward(loss)
  File "/private/home/sshleifer/Transformer-Clinic/fairseq/fairseq/optim/fp16_optimizer.py", line 103, in backward
    loss.backward()
  File "/private/home/sshleifer/.conda/envs/clinic/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/private/home/sshleifer/.conda/envs/clinic/lib/python3.8/site-packages/torch/autograd/__init__.py", line 125, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Function 'SoftmaxBackward' returned nan values in its 0th output.

contents of profile_ratio.init: https://gist.github.com/sshleifer/b615558499b9b10bd5bee8ddf2db030a

Data directory:

Issue Analytics

State:
Created 3 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

sshleifercommented, Dec 20, 2020

I would add python 3.6, torch 1.5 or torch 1.6 to the README. I think with those versions and some guidance that training takes a really long time it will make sense. The logs were really helpful, I think I am getting similar results now.

1reaction

LiyuanLucasLiucommented, Dec 3, 2020

This is weird, all admin models on wmt14ende failed in your setting. I compared your logs and my logs, and their development ppls are almost the same. Seems that the training just shut down for some unknown reasons…

One random guess is half-precision training (since it should detects Nan gradients and adjust the scaling accordingly). Maybe you can load the last checkpoint and see whether full precision training can avoid the error?

Another random guess is the distributed training. Maybe you can load the last checkpoint and see whether one-gpu setting can avoid the error (with UPDATE_FREQUENCE=4)?

Also, it could be caused by OOM. Maybe you can load the last checkpoint and see whether halving the batch size and setting UPDATE_FREQUENCE=2 can avoid the error?

(sorry I’m fully occupied with job applications this week and cannot do experiments with the 1.6 and 10.1 version…)

Top Results From Across the Web

Function 'SoftmaxBackward' returned nan values in its 0th output

I'm training tacotron2 (a TTS model) using the seq2seq model with attention. I use nvidia apex to train the model with mixed precision, ......

Function 'LogSoftmaxBackward0' returned nan values in its ...

RuntimeError : Function 'LogSoftmaxBackward0' returned nan values in its 0th output. Why if NUM_OF_CELLS is increased from 8 to 16 , the ...

Error: Function 'AddmmBackward' returned nan values in its ...

The runtime error is generated during the training phase only. ... By using print(model.fc1.weight.grad) in the code, the output I get is:

RuntimeError: Function 'MulBackward0' returned nan values ...

RuntimeError : Function 'MulBackward0' returned nan values in its 0th output. I pull the codes from gitlab and did not change them.

Error:function 'LogSoftmaxBackward' returned nan values in ...

Error:function 'LogSoftmaxBackward' returned nan values in its 0th output. 原因分析. 产生这个问题的原因可能有几种： 1.数据中出现NAN——数据 ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

wmt_en_de admin: Function 'SoftmaxBackward' returned nan values in its 0th output.

Command

Traceback

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

`RuntimeError: expected scalar type Float but found Half` during the eval step

How to train with multiple videos in my custom dataset?