Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to train a simple, vanilla transformers translation model from scratch with Fairseq

See original GitHub issue

I have been familiarizing myself with the fairseq library recently, and have tried a couple of pretrained models. I thought that a good way to teach myself would be to train a plain vanilla transformers model with the data I have, and then I can modify and maybe add bells and whistles like pre-training from there. The fairseq documentation has an example of this with fconv architecture, and I basically would like to do the same with transformers.

Below is the code I tried:

In data preparation, I cleaned the data with moses script, tokenized words, and then applied BPE using subword-nmt, where I set number of BPE tokens to 15000.

For preprocessing:

fairseq-preprocess --source-lang zh --target-lang en \
    --trainpref data/train --validpref data/valid --testpref data/test \
    --joined-dictionary \
    --destdir data-bin \
    --workers 20

For training:

CUDA_VISIBLE_DEVICES=0,1,2,3

fairseq-train data-bin \
    --clip-norm 0.1 --dropout 0.2 --max-tokens 2048 \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt \
    --criterion label_smoothed_cross_entropy \
    --lazy-load \
    --update-freq 4 \
    --keep-interval-updates 100 --save-interval-updates 3000  --log-interval 50 \
    --arch transformer --save-dir checkpoints/transformer

I trained this on a data set of ~19M samples, on 4 NVIDIA P100 GPUs, for about 8 hours – at that point I had completed 1 epoch and a bit more. I tested this against my checkpoints – for the first checkpoint at update 3000, the prediction was all "the the the"s – but that might be ok because it was just the first checkpoint. However, I then tested this against the last checkpoint, and the prediction was the same sentence for all test samples!! – The prediction was “committee on the peaceful uses of outer space” for everything, and the BLEU score was 0. My test set is not at all about outer space.

So after this extremely disappointing result, I realized that I should ask for some pointers on creating a basic transformers model:

First of all, is my result actually within expectation? The paper on which transformer.py is based, Jointly Learning to Align and Translate, stated that state of the art results are achieved on 64 Volta GPUs for 30k updates (!!!) – my set up was much smaller, so maybe the result was expected? However, I have achieved better results in less time with less data, so I doubt that. Is it just that the learning rate was not set right so that it got stuck in some weird local minima? Or are there more things wrong with my setup above?
When would the above model stop? max_epoch and max_update are not required parameters and are set to math.inf when not given. From train.py, it looks like training goes on until learning rate gets to below args.min_lr, however I can’t find where min_lr is set, and it is not a parameter in the documentation, so what is min_lr? Is it 0?
What is the best architecture to use for the “vanilla” transformer model that I’m looking for?

Thank you!

Issue Analytics

State:
Created 4 years ago
Comments:13 (2 by maintainers)

Top GitHub Comments

1reaction

cgr71iicommented, Jul 21, 2022

Hi, @TheodorePeng,

Yes, the BLEU value was close to 0 in training and evaluation. The problem was related to the fact that I was not using a LR scheduler. As I said, you can check out the transformers paper and the other link. In the paper, the LR scheduler is mandatory, but they only mentioned the step and didn’t give the necessary importance it has. In the article of the other link, you can see that a reason about why the LR scheduler is necessary is tried to be found. When you train a transformers, a LR scheduler is needed, I’m not sure why. So I solved the issue thanks, mainly, to --lr-scheduler inverse_sqrt, and the other flags are values you can try to optimize to get better results.

1reaction

cgr71iicommented, Sep 27, 2021

I’ve been dealing with the same problem, but for other language pair (et-en). I solved my problem adding the following flags to fairseq-train:

--lr-scheduler inverse_sqrt --warmup-updates 8000 --warmup-init-lr 1e-7

After these flags were added, the training started to work perfectly. Check out https://arxiv.org/pdf/1706.03762.pdf#optimizer and https://www.borealisai.com/en/blog/tutorial-17-transformers-iii-training/ --> “Learning rate warm-up”

Top Results From Across the Web

How to train a simple, vanilla transformers translation model ...

I have been familiarizing myself with the fairseq library recently, and have tried a couple of pretrained models.

How to train a simple, vanilla transformers translation model ...

I have been familiarizing myself with the fairseq library recently, and have tried a couple of pretrained models. I thought that a good...

Porting fairseq wmt19 translation system to transformers

Generate translation: take input numbers, run them through a pre-trained machine learning model which predicts the best translation, and return ...

Training FairSeq Transformer on Cloud TPU using PyTorch

This tutorial specifically focuses on the FairSeq version of Transformer, and the WMT 18 translation task, translating English to German.

Sequence Length is a Domain: Length-based Overfitting in ...

We demonstrate how the Transformer-based sequence-to-sequence models fail when the target sequence lengths of the training and validation data do not match.