How to train a simple, vanilla transformers translation model from scratch with Fairseq
See original GitHub issueI have been familiarizing myself with the fairseq library recently, and have tried a couple of pretrained models. I thought that a good way to teach myself would be to train a plain vanilla transformers model with the data I have, and then I can modify and maybe add bells and whistles like pre-training from there. The fairseq documentation has an example of this with fconv architecture, and I basically would like to do the same with transformers.
Below is the code I tried:
In data preparation, I cleaned the data with moses script, tokenized words, and then applied BPE using subword-nmt, where I set number of BPE tokens to 15000.
For preprocessing:
fairseq-preprocess --source-lang zh --target-lang en \
--trainpref data/train --validpref data/valid --testpref data/test \
--joined-dictionary \
--destdir data-bin \
--workers 20
For training:
CUDA_VISIBLE_DEVICES=0,1,2,3
fairseq-train data-bin \
--clip-norm 0.1 --dropout 0.2 --max-tokens 2048 \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr 5e-4 --lr-scheduler inverse_sqrt \
--criterion label_smoothed_cross_entropy \
--lazy-load \
--update-freq 4 \
--keep-interval-updates 100 --save-interval-updates 3000 --log-interval 50 \
--arch transformer --save-dir checkpoints/transformer
I trained this on a data set of ~19M samples, on 4 NVIDIA P100 GPUs, for about 8 hours – at that point I had completed 1 epoch and a bit more. I tested this against my checkpoints – for the first checkpoint at update 3000, the prediction was all "the the the"s – but that might be ok because it was just the first checkpoint. However, I then tested this against the last checkpoint, and the prediction was the same sentence for all test samples!! – The prediction was “committee on the peaceful uses of outer space” for everything, and the BLEU score was 0. My test set is not at all about outer space.
So after this extremely disappointing result, I realized that I should ask for some pointers on creating a basic transformers model:
- First of all, is my result actually within expectation? The paper on which transformer.py is based, Jointly Learning to Align and Translate, stated that state of the art results are achieved on 64 Volta GPUs for 30k updates (!!!) – my set up was much smaller, so maybe the result was expected? However, I have achieved better results in less time with less data, so I doubt that. Is it just that the learning rate was not set right so that it got stuck in some weird local minima? Or are there more things wrong with my setup above?
- When would the above model stop? max_epoch and max_update are not required parameters and are set to math.inf when not given. From train.py, it looks like training goes on until learning rate gets to below args.min_lr, however I can’t find where min_lr is set, and it is not a parameter in the documentation, so what is min_lr? Is it 0?
- What is the best architecture to use for the “vanilla” transformer model that I’m looking for?
Thank you!
Issue Analytics
- State:
- Created 4 years ago
- Comments:13 (2 by maintainers)
Top GitHub Comments
Hi, @TheodorePeng,
Yes, the BLEU value was close to 0 in training and evaluation. The problem was related to the fact that I was not using a LR scheduler. As I said, you can check out the transformers paper and the other link. In the paper, the LR scheduler is mandatory, but they only mentioned the step and didn’t give the necessary importance it has. In the article of the other link, you can see that a reason about why the LR scheduler is necessary is tried to be found. When you train a transformers, a LR scheduler is needed, I’m not sure why. So I solved the issue thanks, mainly, to
--lr-scheduler inverse_sqrt
, and the other flags are values you can try to optimize to get better results.I’ve been dealing with the same problem, but for other language pair (et-en). I solved my problem adding the following flags to
fairseq-train
:After these flags were added, the training started to work perfectly. Check out https://arxiv.org/pdf/1706.03762.pdf#optimizer and https://www.borealisai.com/en/blog/tutorial-17-transformers-iii-training/ --> “Learning rate warm-up”