Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

bart-large-cnn ROUGE-L scores

See original GitHub issue

Environment info

Who can help

BART + Summarization @sshleifer

Information

Model I am using is BART.

The problem arises when: verifying accuracy numbers of facebook/bart-large-cnn on CNN+Daily Mail. The paper reports R1, R2, RL of 44.16, 21.28, 40.90 but I can get only 44.05, 21.07, 30.62. I used this to make my dataset. Is this expected?

The tasks I am working on is:

CNN-Dm summarization task

To reproduce

Steps to reproduce the behavior:

Follow instructions to download dataset
Run with python run_summarization.py --reference_path=data/cnn_dm/test.target data/cnn_dm/test.source results/test.log

Issue Analytics

State:
Created 3 years ago
Comments:7 (5 by maintainers)

Top GitHub Comments

1reaction

yxyzzzcommented, Sep 16, 2020

rouge_score split sentences by ‘\n’. You can add a ‘\n’ to separate sentences in the summaries and evaluate. The summary level rougeL (rougeLsum) should be a lot higher and close to the one in the literature. ‘{‘rouge1’: 44.0536, ‘rouge2’: 21.0711, ‘rougeL’: 30.6157, ‘rougeLsum’: 40.9812}’

output_ln2 = []
for o in `output_ln:
    s = sent_tokenize(p)
    output_ln2.append('\n'.join(s))

Use pyrouge -> https://pypi.org/project/pyrouge/

0reactions

swethmandavacommented, Sep 24, 2020

Opened a PR at #7356 that fixes this issue @sshleifer

Top Results From Across the Web

Facebook/bart-large-cnn has a low rouge score on ...

I test on test split in the tensorflow_dataset. Use python library rouge to output rouge score. The score is quite low, compared to...

CNN / Daily Mail Benchmark (Abstractive Text Summarization)

Rank Model ROUGE‑1 ROUGE‑2 ROUGE‑L Year 1 Pegasus 2B + SLiC 47.97 24.18 44.88 2022 2 BRIO 47.78 23.55 44.57 2022 3 PEGASUS + SummaReranker 47.16...

Few-Shot Fine-Tuning SOTA Summarization Models for ...

BART -CNN-SAMSUM scored highest for ROUGE-1 and ROUGE-2, but T5-SAMSUM outperformed it for the ROUGE-L and ROUGE-L-sum scores.

arXiv:2112.02741v1 [cs.CL] 6 Dec 2021

with a pre-trained BART model fine-tuned on a summarization ... bart-large-cnn-samsum ... Finally, our ROUGE scores were narrowly better.

Abstractive Summarization of Long Medical Documents with ...

high ROUGE scores, they lacked an understanding of what was factual, and only used ... For our baseline, we used the sshleifer/distilbart-cnn-12-6 model ......