question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Exploding in WMT14 en-fr

See original GitHub issue

Hello. I’ve processed my data and set training parameters as the same in pre-trained models/wmt14.en-fr.fconv-py/README.md. However, I get

| [en] dictionary: 43881 types
| [fr] dictionary: 43978 types
| data-bin train 35482842 examples
| data-bin valid 26663 examples
| data-bin test 3003 examples
| using 8 GPUs (with max tokens per GPU = 4000)
| model fconv_wmt_en_fr
Warning! 1 samples are either too short or too long and will be ignored, sample ids=[28743556]
| epoch 001  1000 / 331737 loss=9.57 (10.94), wps=18515, wpb=31259, bsz=861, lr=1.25, clip=100%, gnorm=2.0540
| epoch 001  2000 / 331737 loss=8.61 (9.91), wps=18466, wpb=31229, bsz=877, lr=1.25, clip=100%, gnorm=1.7149
| epoch 001  3000 / 331737 loss=7.50 (9.23), wps=18493, wpb=31226, bsz=871, lr=1.25, clip=100%, gnorm=2.7501
| epoch 001  4000 / 331737 loss=6.87 (8.75), wps=18522, wpb=31231, bsz=873, lr=1.25, clip=100%, gnorm=100615.8788
| epoch 001  5000 / 331737 loss=10405.01 (136.96), wps=18532, wpb=31216, bsz=874, lr=1.25, clip=100%, gnorm=1500459828271.3960
| epoch 001  6000 / 331737 loss=4773454961.36 (92926125.94), wps=18564, wpb=31213, bsz=867, lr=1.25, clip=100%, gnorm=37459419138681.4219
| epoch 001  7000 / 331737 loss=7746569234820.15 (126329286789.38), wps=18577, wpb=31211, bsz=864, lr=1.25, clip=100%, gnorm=inf
| epoch 001  8000 / 331737 loss=18016233617.10 (228909462625.55), wps=18562, wpb=31205, bsz=866, lr=1.25, clip=100%, gnorm=inf
| epoch 001  9000 / 331737 loss=6500325670920.53 (321325856038.58), wps=18597, wpb=31214, bsz=860, lr=1.25, clip=100%, gnorm=inf
| epoch 001 10000 / 331737 loss=11162501170786.86 (715142464195.40), wps=18609, wpb=31219, bsz=858, lr=1.25, clip=100%, gnorm=inf
....

--------------------------ENV---------------------------- P40 8cards

--------------------------DATA PREPROCESSING----------------------------

  1. normalize-punctuation
  2. tokenizer
  3. clean-corpus-n
  4. shuffle
  5. learn and apply bpe I’ve checked en-fr data corresponding relationship after preprocessing.

--------------------------TRAINING PARAMETER---------------------------- fairseq_train_param=“-s en -t fr --arch fconv_wmt_en_fr
–dropout 0.1 --lr 1.25 --clip-norm 0.1 --max-tokens 4000 --force-anneal 32”

Can you help me to figure out my problem? Thank you.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:20 (20 by maintainers)

github_iconTop GitHub Comments

1reaction
Zrachelcommented, Jan 2, 2018

Hi @edunov , I found some difference in our reported results:

Your reported result above is in the format of:

checkpoint1.pt: | Generate test with beam=5: BLEU4 = 35.91, 64.4/42.0/29.4/20.9 (BP=1.000, ratio=1.000, syslen=81191, reflen=81194)

which might be directly calculated by generate.py.


While, mine is in the format of

checkpoint1.pt/test.bleu:BLEU4 = 30.11, 59.5/35.8/23.8/16.2 (BP=1.000, ratio=0.975, syslen=83264, reflen=81204)

It is achieved in this way:

python generate.py $data --path $path > $out/test.out
grep ^H $out/test.out | sed 's/^H-//g' | sort -k1,1n | cut -f 3 | sed 's/@@ //g' > $out/test.trans
grep ^T $out/test.out | sed 's/^T-//g' | sort -k1,1n | cut -f 2 | sed 's/@@ //g' > $out/test.ref
python ../../PyFairseq/score.py --sys $out/test.trans --ref $out/test.ref > $out/test.bleu

Can you paste your result of score.py for reference? Thank you.

1reaction
myleottcommented, Nov 12, 2017

We recently discovered an issue with recent versions of PyTorch and our multi-GPU training code. The fix is here: https://github.com/facebookresearch/fairseq-py/commit/d7d82715f968097bba08c92416d332d969bd1f06. Can you update your fairseq-py or apply the fix and see if it solves the exploding loss issue?

Read more comments on GitHub >

github_iconTop Results From Across the Web

An In-depth Walkthrough on Evolution of Neural Machine ...
WMT'14 English to French dataset. ... avoid explosion of values after the dot-product. The ... WMT'14 En-Fr and En-De. other datasets used.
Read more >
Improving Deep Transformer with Depth-Scaled Initialization and ...
One common problem for the training of deep neural models are vanishing or exploding gradients. Existing methods mainly focus on developing novel network ......
Read more >
What Works and Doesn't Work, A Deep Decoder for Neural ...
popular machine translation benchmarks: WMT14 ... exploding. Zhang et al. ... translation models on WMT14 En→De and En→Fr tasks.
Read more >
What Works and Doesn't Work, A Deep Decoder for Neural ...
popular machine translation benchmarks: WMT14 ... exploding. Zhang et al. (2019) proposed a depth- ... translation models on WMT14 En→De and En→Fr tasks....
Read more >
Convolutional Sequence to Sequence Learning
The notorious vanishing/exploding gradients problem ... Translation speed on English-French (WMT'14, dev set). ConvS2S: Speed ... WMT14 EN-FR.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found