Cannot replicate T5 performance on WMT14
See original GitHub issueSystem Info
I am trying to replicate T5 finetuning on WMT with the following hyperparameters (as close as possible to the paper https://www.jmlr.org/papers/volume21/20-074/20-074.pdf):
–model_name_or_path t5-small –source_lang en –target_lang de –dataset_name stas/wmt14-en-de-pre-processed –max_source_length 512 –max_target_length 512 –val_max_target_length 512 –source_prefix="translate English to German: " –predict_with_generate –save_steps 5000 –eval_steps 5000 –learning_rate 0.001 –max_steps 262144 –optim adafactor –lr_scheduler_type constant –gradient_accumulation_steps 2 --per_device_train_batch_size 64
However, the best model performance I get is around 13 BLEU whereas in the paper reported BLEU is around 27. Any comments on how to fix this ?
Script: https://github.com/huggingface/transformers/blob/main/examples/pytorch/translation/run_translation.py Environment:
transformers
version: 4.20.1- Platform: Linux-4.18.0-348.el8.x86_64-x86_64-with-glibc2.28
- Python version: 3.10.4
- Huggingface_hub version: 0.8.1
- PyTorch version (GPU?): 1.12.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes - A100
- Using distributed or parallel set-up in script?: No
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Use the script with the hyperparameters above : https://github.com/huggingface/transformers/blob/main/examples/pytorch/translation/run_translation.py
Expected behavior
BLEU score should be around 27.
Issue Analytics
- State:
- Created a year ago
- Comments:14 (2 by maintainers)
Top GitHub Comments
@ekurtulus I also think the checkpoints
t5-small
,t5-base
etc. have been trained on WMT / CNN Dailymail datasets, as shown in the code snippet below. So using those checkpoints to replicate the results (by finetuning on those datasets) doesn’t really make sense IMO.Code snippet
Outputs
Sorry for being late. I will take a look.