question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reproducing result on WMT14' en-fr

See original GitHub issue

Following the latest code with training parameter specified by @edunov in https://github.com/facebookresearch/fairseq-py/issues/41 and Readme.md of Pretrained-models, I got exploding update on WMT14 en-fr:

+ miniconda3/bin/python3 PyFairseq/train.py data-bin --save-dir model -s en -t fr --arch fconv_wmt_en_fr --dropout 0.1 --lr 2.5 --clip-norm 0.1 --max-tokens 4000 --force-anneal 32 
Namespace(adam_betas='(0.9, 0.999)', arch='fconv_wmt_en_fr', clip_norm=0.1, curriculum=0, data='data-bin', decoder_attention='True', decoder_embed_dim=768, decoder_layers='[(512, 3)] * 6 + [(768, 3)] * 4 + [(1024, 3)] * 3 + [(2048, 1)] * 1 + [(4096, 1)] * 1', decoder_out_embed_dim=512, dropout=0.1, encoder_embed_dim=768, encoder_layers='[(512, 3)] * 6 + [(768, 3)] * 4 + [(1024, 3)] * 3 + [(2048, 1)] * 1 + [(4096, 1)] * 1', force_anneal=32, label_smoothing=0, log_format=None, log_interval=1000, lr='2.5', lrshrink=0.1, max_epoch=0, max_sentences=None, max_source_positions=1024, max_target_positions=1024, max_tokens=4000, min_lr=1e-05, model='fconv', momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, num_gpus=8, optimizer='nag', restore_file='checkpoint_last.pt', sample_without_replacement=0, save_dir='model', save_interval=-1, seed=1, sentence_avg=False, skip_invalid_size_inputs_valid_test=False, source_lang='en', target_lang='fr', train_subset='train', valid_subset='valid', weight_decay=0.0, workers=1) 
| [en] dictionary: 43881 types 
| [fr] dictionary: 43978 types 
| data-bin train 35482842 examples 
| data-bin valid 26663 examples 
| using 8 GPUs (with max tokens per GPU = 4000 and max sentences per GPU = None) 
| model fconv_wmt_en_fr, criterion CrossEntropyCriterion 
Warning! 1 samples are either too short or too long and will be ignored, first few sample ids=[28743556] 
| epoch 001: 1000 / 331737 loss=9.21 (10.89), wps=16319, wpb=31291, bsz=850, lr=2.5, clip=100%, gnorm=2.50713, oom=0 
| epoch 001: 2000 / 331737 loss=588.92 (19.76), wps=16417, wpb=31241, bsz=838, lr=2.5, clip=100%, gnorm=5.39344e+09, oom=0 
| epoch 001: 3000 / 331737 loss=126867869305.41 (3395251823.97), wps=16436, wpb=31258, bsz=849, lr=2.5, clip=100%, gnorm=2.05028e+16, oom=0 
| epoch 001: 4000 / 331737 loss=137727644131954352.00 (3821157344375131.00), wps=16438, wpb=31229, bsz=853, lr=2.5, clip=100%, gnorm=inf, oom=0 
| epoch 001: 5000 / 331737 loss=358248860949876800.00 (64219013624718560.00), wps=16454, wpb=31251, bsz=861, lr=2.5, clip=100%, gnorm=inf, oom=0 
| epoch 001: 6000 / 331737 loss=74803270219822464.00 (85287362140370208.00), wps=16464, wpb=31255, bsz=857, lr=2.5, clip=100%, gnorm=inf, oom=0 
| epoch 001: 7000 / 331737 loss=1124810776667683.12 (75791177781467504.00), wps=16478, wpb=31266, bsz=854, lr=2.5, clip=100%, gnorm=inf, oom=0 
| epoch 001: 8000 / 331737 loss=nan (nan), wps=16486, wpb=31252, bsz=852, lr=2.5, clip=94%, gnorm=nan, oom=0 
| epoch 001: 9000 / 331737 loss=nan (nan), wps=16493, wpb=31241, bsz=852, lr=2.5, clip=83%, gnorm=nan, oom=0 
| epoch 001: 10000 / 331737 loss=nan (nan), wps=16502, wpb=31244, bsz=855, lr=2.5, clip=75%, gnorm=nan, oom=0 
| epoch 001: 11000 / 331737 loss=nan (nan), wps=16511, wpb=31239, bsz=855, lr=2.5, clip=68%, gnorm=nan, oom=0 
| epoch 001: 12000 / 331737 loss=nan (nan), wps=16521, wpb=31240, bsz=855, lr=2.5, clip=62%, gnorm=nan, oom=0 
| epoch 001: 13000 / 331737 loss=nan (nan), wps=16529, wpb=31244, bsz=853, lr=2.5, clip=58%, gnorm=nan, oom=0 
| epoch 001: 14000 / 331737 loss=nan (nan), wps=16536, wpb=31239, bsz=851, lr=2.5, clip=53%, gnorm=nan, oom=0 
| epoch 001: 15000 / 331737 loss=nan (nan), wps=16539, wpb=31236, bsz=852, lr=2.5, clip=50%, gnorm=nan, oom=0 

Only change the learning rate to 1.25 would not trigger the exploding problem, but BLEU increases very slow:

checkpoint1.pt/test.bleu:BLEU4 = 30.11, 59.5/35.8/23.8/16.2 (BP=1.000, ratio=0.975, syslen=83264, reflen=81204)
checkpoint2.pt/test.bleu:BLEU4 = 31.34, 60.4/37.1/25.0/17.2 (BP=1.000, ratio=0.986, syslen=82348, reflen=81204)
checkpoint3.pt/test.bleu:BLEU4 = 32.56, 61.4/38.4/26.1/18.2 (BP=1.000, ratio=0.988, syslen=82230, reflen=81204)
checkpoint4.pt/test.bleu:BLEU4 = 32.71, 61.5/38.5/26.3/18.4 (BP=1.000, ratio=0.989, syslen=82140, reflen=81204)
checkpoint5.pt/test.bleu:BLEU4 = 33.13, 62.0/38.9/26.7/18.7 (BP=1.000, ratio=0.997, syslen=81437, reflen=81204)
checkpoint6.pt/test.bleu:BLEU4 = 33.04, 61.5/38.8/26.7/18.7 (BP=1.000, ratio=0.995, syslen=81632, reflen=81204)
checkpoint7.pt/test.bleu:BLEU4 = 33.01, 61.6/38.8/26.6/18.7 (BP=1.000, ratio=0.987, syslen=82282, reflen=81204)
checkpoint8.pt/test.bleu:BLEU4 = 33.60, 62.2/39.4/27.2/19.1 (BP=1.000, ratio=0.992, syslen=81830, reflen=81204)
checkpoint9.pt/test.bleu:BLEU4 = 33.07, 61.6/38.9/26.7/18.7 (BP=1.000, ratio=0.993, syslen=81783, reflen=81204)
checkpoint10.pt/test.bleu:BLEU4 = 33.39, 62.2/39.3/27.0/19.0 (BP=0.999, ratio=1.001, syslen=81099, reflen=81204)
checkpoint11.pt/test.bleu:BLEU4 = 33.74, 62.5/39.6/27.3/19.2 (BP=1.000, ratio=0.993, syslen=81744, reflen=81204)
checkpoint12.pt/test.bleu:BLEU4 = 33.37, 61.8/39.1/27.0/19.0 (BP=1.000, ratio=0.992, syslen=81892, reflen=81204)
checkpoint13.pt/test.bleu:BLEU4 = 34.07, 62.6/39.9/27.6/19.5 (BP=1.000, ratio=0.996, syslen=81534, reflen=81204)
checkpoint14.pt/test.bleu:BLEU4 = 33.81, 62.4/39.6/27.4/19.3 (BP=1.000, ratio=0.994, syslen=81685, reflen=81204)
checkpoint15.pt/test.bleu:BLEU4 = 33.78, 62.6/39.7/27.3/19.2 (BP=0.999, ratio=1.001, syslen=81110, reflen=81204)
checkpoint16.pt/test.bleu:BLEU4 = 34.09, 62.8/39.9/27.6/19.5 (BP=1.000, ratio=0.994, syslen=81723, reflen=81204)
checkpoint17.pt/test.bleu:BLEU4 = 33.94, 62.3/39.7/27.5/19.5 (BP=1.000, ratio=0.990, syslen=81988, reflen=81204)
checkpoint18.pt/test.bleu:BLEU4 = 34.43, 62.8/40.2/28.0/19.9 (BP=1.000, ratio=0.993, syslen=81811, reflen=81204)
checkpoint19.pt/test.bleu:BLEU4 = 34.14, 62.6/40.0/27.7/19.6 (BP=1.000, ratio=0.994, syslen=81661, reflen=81204)
checkpoint20.pt/test.bleu:BLEU4 = 34.05, 62.5/39.9/27.6/19.6 (BP=1.000, ratio=0.999, syslen=81314, reflen=81204)
checkpoint21.pt/test.bleu:BLEU4 = 34.20, 62.8/40.0/27.8/19.6 (BP=1.000, ratio=0.999, syslen=81259, reflen=81204)
checkpoint22.pt/test.bleu:BLEU4 = 34.13, 62.4/40.0/27.7/19.6 (BP=1.000, ratio=0.998, syslen=81331, reflen=81204)
checkpoint23.pt/test.bleu:BLEU4 = 34.31, 62.6/40.1/27.9/19.8 (BP=1.000, ratio=0.991, syslen=81972, reflen=81204)
checkpoint26.pt/test.bleu:BLEU4 = 34.11, 62.9/40.1/27.7/19.4 (BP=1.000, ratio=0.999, syslen=81260, reflen=81204)

My question is: Is the results I got within expectation? Should I wait for the result of lr=1.25, or there is something wrong with my data/config?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:13 (11 by maintainers)

github_iconTop GitHub Comments

2reactions
Zrachelcommented, Jan 3, 2018

My fault. I once removed the lowercase operation for training data, but forgot to remove in test data. Thank you very much.

Result on corrected testset:

checkpoint1.pt/test.bleu:BLEU4 = 35.73, 64.2/41.8/29.2/20.8 (BP=1.000, ratio=0.991, syslen=81952, reflen=81194)
checkpoint2.pt/test.bleu:BLEU4 = 37.20, 65.2/43.3/30.7/22.2 (BP=0.999, ratio=1.001, syslen=81142, reflen=81194)

Training and validation loss:

...
| epoch 001 | train loss 2.24 | train ppl 4.73 | s/checkpoint 77473 | words/s 16713 | words/batch 31228 | bsz 856 | lr 1.25 | clip 18% | gnorm 0.0936811 
| epoch 001 | valid on 'valid' subset | valid loss 1.75 | valid ppl 3.37 
...
| epoch 002 | train loss 1.76 | train ppl 3.38 | s/checkpoint 78280 | words/s 16541 | words/batch 31228 | bsz 856 | lr 1.25 | clip 0% | gnorm 0.0558167 
| epoch 002 | valid on 'valid' subset | valid loss 1.63 | valid ppl 3.11 
Read more comments on GitHub >

github_iconTop Results From Across the Web

Reproducing result on WMT14' en-fr · Issue #85 - GitHub
My question is: Is the results I got within expectation? Should I wait for the result of lr=1.25 , or there is something...
Read more >
examples/scaling_nmt/README.md · gradio/HuBERT at main
This page includes instructions for reproducing results from the paper ... .com/fairseq/models/wmt14.en-fr.joined-dict.transformer.tar.bz2) ...
Read more >
Translation Task - ACL 2014 Ninth Workshop on Statistical ...
If you want to reproduce results from the campaign, use these. NEW: Cleaned Test sets (3.2 MB) These include fixes to minor encoding...
Read more >
arXiv:2204.00665v1 [cs.CL] 1 Apr 2022
We propose a novel data-augmentation tech- nique for neural machine translation based on. ROT-k ciphertexts. ROT-k is a simple letter.
Read more >
examples/conv_seq2seq/README.md ...
(Gehring et al., 2017), WMT14 English-French · download (.tar.bz2) ... for instructions on reproducing results for WMT'14 En-De and WMT'14 En-Fr using the ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found