Reproducibility issue on XSum dataset
See original GitHub issueHi,
I followed the instructions for fine-tuning UniLM v1.2 on XSum task. Specifically I used the following commands:
python -m torch.distributed.launch --nproc_per_node=4 run_seq2seq.py \
--train_file ${TRAIN_FILE} --output_dir ${OUTPUT_DIR} --model_type unilm \
--model_name_or_path unilm1.2-base-uncased --do_lower_case \
--fp16 --fp16_opt_level O2 --max_source_seq_length 464 --max_target_seq_length 48 \
--per_gpu_train_batch_size 16 --gradient_accumulation_steps 1 --learning_rate 7e-5 \
--num_warmup_steps 500 --num_training_steps 32000 --cache_dir ${CACHE_DIR}
The training loss at the end of fine-tuning goes to ~1.9
Next I decode on the test set using:
python decode_seq2seq.py --fp16 --model_type unilm --tokenizer_name unilm1.2-base-uncased \
--input_file ${INPUT_JSON} --split $SPLIT --do_lower_case --model_path ${MODEL_PATH} \
--max_seq_length 512 --max_tgt_length 48 --batch_size 16 --beam_size 5 --length_penalty 0 \
--forbid_duplicate_ngrams --mode s2s --forbid_ignore_word "."
And evaluate using:
python evaluations/eval_for_xsum.py --pred ${MODEL_PATH}.${SPLIT} \
--gold ${GOLD_PATH} --split ${SPLIT} --perl
However, this gives me following ROUGE scores:
ROUGE-F(1/2/l): 9.75/4.47/7.22
ROUGE-R(1/2/3/l): 5.34/2.44/3.95
I also tried fine-tuning miniLM based on the instructions provided. The results are similar to above.
@donglixp @wolfshow Could you please guide me on what I must be doing wrong?
Thanks in advance!
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
arXiv:2101.06561v4 [cs.CL] 1 Nov 2022
Toward Reproducible and Standardized Human Evaluation for Text Generation ... chose XSUM over alternative datasets for text sum-.
Read more >XSum Dataset - Papers With Code
The Extreme Summarization (XSum) dataset is a dataset for evaluation of abstractive single-document summarization systems. The goal is to create a short, ...
Read more >GENeratIve Evaluation (GENIE) - Summarization XSUM
This leaderboard collects evaluations of current AI systems on various commonsense/reasoning tasks that measure both the knowledge that these systems possess as ...
Read more >Towards Improving Faithfulness in Abstractive Summarization
Extensive experiments on two benchmark summarization datasets, CNN/DM and XSum, ... XSum has more faithfulness issues in general.
Read more >A Close Examination of Factual Correctness Evaluation in ...
Generating fabricated facts has been a long-standing problem of abstractive sum- ... XSum dataset, respectively, and measure factual alignments with 5-scale ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@donglixp Oops, that was indeed the issue. I was wrongly using
xsum.test.json
as$GOLD_PATH
It works perfectly now with the following ROUGE scores:
Thanks for the help. You may close this issue 👍
I checked your decoding results. They look well aligned with the reference. In the evaluation command, is the env var
${GOLD_PATH}
pointed to the file (test.target
) of https://unilm.blob.core.windows.net/s2s-ft-data/xsum.eval.zip ?