question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

bart-large-cnn ROUGE-L scores

See original GitHub issue

Environment info

Who can help

BART + Summarization @sshleifer

Information

Model I am using is BART.

The problem arises when: verifying accuracy numbers of facebook/bart-large-cnn on CNN+Daily Mail. The paper reports R1, R2, RL of 44.16, 21.28, 40.90 but I can get only 44.05, 21.07, 30.62. I used this to make my dataset. Is this expected?

The tasks I am working on is:

  • CNN-Dm summarization task

To reproduce

Steps to reproduce the behavior:

  1. Follow instructions to download dataset
  2. Run with python run_summarization.py --reference_path=data/cnn_dm/test.target data/cnn_dm/test.source results/test.log

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
yxyzzzcommented, Sep 16, 2020
  1. rouge_score split sentences by ‘\n’. You can add a ‘\n’ to separate sentences in the summaries and evaluate. The summary level rougeL (rougeLsum) should be a lot higher and close to the one in the literature. ‘{‘rouge1’: 44.0536, ‘rouge2’: 21.0711, ‘rougeL’: 30.6157, ‘rougeLsum’: 40.9812}’
output_ln2 = []
for o in `output_ln:
    s = sent_tokenize(p)
    output_ln2.append('\n'.join(s))
  1. Use pyrouge -> https://pypi.org/project/pyrouge/
0reactions
swethmandavacommented, Sep 24, 2020

Opened a PR at #7356 that fixes this issue @sshleifer

Read more comments on GitHub >

github_iconTop Results From Across the Web

Facebook/bart-large-cnn has a low rouge score on ...
I test on test split in the tensorflow_dataset. Use python library rouge to output rouge score. The score is quite low, compared to...
Read more >
CNN / Daily Mail Benchmark (Abstractive Text Summarization)
Rank Model ROUGE‑1 ROUGE‑2 ROUGE‑L Year 1 Pegasus 2B + SLiC 47.97 24.18 44.88 2022 2 BRIO 47.78 23.55 44.57 2022 3 PEGASUS + SummaReranker 47.16...
Read more >
Few-Shot Fine-Tuning SOTA Summarization Models for ...
BART -CNN-SAMSUM scored highest for ROUGE-1 and ROUGE-2, but T5-SAMSUM outperformed it for the ROUGE-L and ROUGE-L-sum scores.
Read more >
arXiv:2112.02741v1 [cs.CL] 6 Dec 2021
with a pre-trained BART model fine-tuned on a summarization ... bart-large-cnn-samsum ... Finally, our ROUGE scores were narrowly better.
Read more >
Abstractive Summarization of Long Medical Documents with ...
high ROUGE scores, they lacked an understanding of what was factual, and only used ... For our baseline, we used the sshleifer/distilbart-cnn-12-6 model ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found