Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reproducibility issue on XSum dataset

See original GitHub issue

Hi,

I followed the instructions for fine-tuning UniLM v1.2 on XSum task. Specifically I used the following commands:

python -m torch.distributed.launch --nproc_per_node=4 run_seq2seq.py  \
--train_file ${TRAIN_FILE} --output_dir ${OUTPUT_DIR}   --model_type unilm \
--model_name_or_path unilm1.2-base-uncased   --do_lower_case \
--fp16 --fp16_opt_level O2 --max_source_seq_length 464 --max_target_seq_length 48   \
--per_gpu_train_batch_size 16 --gradient_accumulation_steps 1   --learning_rate 7e-5 \
--num_warmup_steps 500 --num_training_steps 32000 --cache_dir ${CACHE_DIR}

The training loss at the end of fine-tuning goes to ~1.9

Next I decode on the test set using:

python decode_seq2seq.py   --fp16 --model_type unilm --tokenizer_name unilm1.2-base-uncased \
--input_file ${INPUT_JSON} --split $SPLIT --do_lower_case   --model_path ${MODEL_PATH} \
--max_seq_length 512 --max_tgt_length 48 --batch_size 16 --beam_size 5   --length_penalty 0 \
--forbid_duplicate_ngrams --mode s2s --forbid_ignore_word "."

And evaluate using:

python evaluations/eval_for_xsum.py --pred ${MODEL_PATH}.${SPLIT} \
--gold ${GOLD_PATH} --split ${SPLIT} --perl

However, this gives me following ROUGE scores:

ROUGE-F(1/2/l): 9.75/4.47/7.22
ROUGE-R(1/2/3/l): 5.34/2.44/3.95

I also tried fine-tuning miniLM based on the instructions provided. The results are similar to above.

@donglixp @wolfshow Could you please guide me on what I must be doing wrong?

Thanks in advance!

Issue Analytics

State:
Created 3 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

2reactions

HareeshBahuleyancommented, Apr 29, 2020

@donglixp Oops, that was indeed the issue. I was wrongly using xsum.test.json as $GOLD_PATH

It works perfectly now with the following ROUGE scores:

ROUGE-1: 0.43251214807560934    ROUGE-2: 0.20548924503436466

Thanks for the help. You may close this issue 👍

0reactions

donglixpcommented, Apr 29, 2020

I checked your decoding results. They look well aligned with the reference. In the evaluation command, is the env var ${GOLD_PATH} pointed to the file (test.target) of https://unilm.blob.core.windows.net/s2s-ft-data/xsum.eval.zip ?

Top Results From Across the Web

arXiv:2101.06561v4 [cs.CL] 1 Nov 2022

Toward Reproducible and Standardized Human Evaluation for Text Generation ... chose XSUM over alternative datasets for text sum-.

XSum Dataset - Papers With Code

The Extreme Summarization (XSum) dataset is a dataset for evaluation of abstractive single-document summarization systems. The goal is to create a short, ...

GENeratIve Evaluation (GENIE) - Summarization XSUM

This leaderboard collects evaluations of current AI systems on various commonsense/reasoning tasks that measure both the knowledge that these systems possess as ...

Towards Improving Faithfulness in Abstractive Summarization

Extensive experiments on two benchmark summarization datasets, CNN/DM and XSum, ... XSum has more faithfulness issues in general.

A Close Examination of Factual Correctness Evaluation in ...

Generating fabricated facts has been a long-standing problem of abstractive sum- ... XSum dataset, respectively, and measure factual alignments with 5-scale ...