Transformer generates unrelated sentences to the input
See original GitHub issueHello, I am trying to use the transformer in a sentence simplification dataset. Training seems to run without problems, but at generation time the hypotheses sentences do not make any sense. I was wondering if you could help me with figuring out what I am doing wrong.
I tried to follow this example that you provide for translation using the transformer.
1. Pre-processing: The dataset I am using for training contains aligned sentences such as this pair:
Original: In Holland they were called Stadspijpers , in Germany Stadtpfeifer and in Italy Pifferi . Simplified: They were called Stadtpfeifer in Germany and Pifferi in Italy .
Since the sentences in the dataset are already tokenised, for pre-processing I only lowercased all sentences and learned/applied BPE using the following script:
src=orig
tgt=simp
prep=data/wikilarge/prep
tmp=$prep/tmp
orig=data/wikilarge
mkdir -p $prep $tmp
for d in train dev test; do
for l in $src $tgt; do
perl $LC < $orig/wikilarge.$d.$l > $tmp/wikilarge.$d.low.$l
done
done
TRAIN=$tmp/train.wikilarge
BPE_CODE=$prep/code
rm -f $TRAIN
for l in $src $tgt; do
cat $tmp/wikilarge.train.low.$l >> $TRAIN
done
python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE
for L in $src $tgt; do
for d in train dev test; do
echo "apply_bpe.py to wikilarge.${d}.low.${L}..."
python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/wikilarge.$d.low.$L > $prep/$d.$L
done
done
Then I proceeded to binary the dataset:
TEXT=data/wikilarge/prep
fairseq-preprocess --source-lang orig --target-lang simp \
--trainpref $TEXT/train --validpref $TEXT/dev --testpref $TEXT/test \
--destdir data/wikilarge/bin/
2. Training For training, I used the same command as in the example provided. I am aware that I’d need to adapt the parameters to suit the dataset, but I thought it was a good starting point.
mkdir -p models/wikilarge/transformer/checkpoints/
CUDA_VISIBLE_DEVICES=0 fairseq-train data/wikilarge/bin \
-a transformer --optimizer adam --lr 0.0005 -s orig -t simp \
--label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 \
--min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --max-update 50000 \
--warmup-updates 4000 --warmup-init-lr '1e-07' \
--adam-betas '(0.9, 0.98)' --save-dir models/wikilarge/transformer/checkpoints/
3. Generation As in the example, I executed the following commands:
# Average 10 latest checkpoints:
python scripts/average_checkpoints.py --inputs models/wikilarge/transformer/checkpoints \
--num-epoch-checkpoints 10 --output models/wikilarge/transformer/checkpoints/model.pt
# Generate
fairseq-generate data/wikilarge/bin \
--path models/wikilarge/transformer/checkpoints/model.pt \
--batch-size 128 --beam 5 --remove-bpe
Most output sentences I get are like this:
S-124 the two former presidents were later separately charged with mutiny and treason for their roles in the 1979 coup and the 1980 gwangju massacre .
T-124 the two former presidents were later charged , each on their own , with mutiny and treason for their roles in the 1979 coup and the 1980 gwangju massacre .
H-124 -1.1352218389511108 he was the first woman to win the tour de france .
P-124 -2.4326 -1.1815 -1.0359 -1.1694 -1.9666 -0.0793 -2.0569 -0.5309 -2.4636 -0.2983 -0.0907 -1.3463 -0.1060
S-258 a town may be correctly described as a market town or as having market rights even if it no longer holds a market , provided the right to do so still exists .
T-258 a town may be correctly identified by a market or as having market rights even if it no longer holds a market , provided the right to do so still exists .
H-258 -0.9187995195388794 this is a list of people who live in the city .
P-258 -3.2003 -0.9018 -1.4129 -0.1210 -0.0787 -1.7663 -0.2098 -1.9090 -0.2615 -0.8027 -0.7472 -0.4241 -0.1091
As can be seen, the generated H sentences make no sense as they are not related at all with the corresponding input.
Am I doing something wrong at training or generation time that causes this? Maybe I am not understanding the parameters properly?
I hope this is the right place to ask this type of question. Thank you.
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (4 by maintainers)
Top GitHub Comments
I see. I’ll start changing the parameters and see what happens. Any suggestion on what try first would be welcome. I’m not very experience in this. This paper uses the transformer (tensor2tensor) for the same data, so I’ll try to use their same configuration in fairseq as a starting point.
Skimming your provided code it looks alright. Does the model training look stable? Is the perplexity decreasing? If you decode on the training set instead of on test/valid, the model can produce the target sentences?