question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Evaluating BART on CNN/DM : How to process dataset

See original GitHub issue

From the README of BART for reproducing CNN/DM results :

Follow instructions here to download and process into data-files such that test.source and test.target has one line for each non-tokenized sample.

After following instructions, I don’t have files like test.source and test.target

Instead, I have test.bin, and chunked version of this file
(chunked/test_000.bin ~ chunked/test_011.bin).


How can I process test.bin into test.source and test.target ?

@ngoyal2707 @yinhanliu

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:11 (3 by maintainers)

github_iconTop GitHub Comments

9reactions
artmatsakcommented, Jan 27, 2020

Here’s a version for Python 3 if anyone is interested:

https://github.com/artmatsak/cnn-dailymail

9reactions
zhaoguangxiangcommented, Dec 6, 2019

There are many details, here is my code.

I fix the over lenght of train.bpe.source caused by ascii ‘0D’ in articles by split and join

I summarize several notes here :

  1. remove " " before “.”
  2. cased, remove the line of lower cased
  3. “\r” in origin articles leads error in bpe preprocess
  4. remove “(CNN)”
  5. bpe encoding

code : https://gist.github.com/zhaoguangxiang/45bf39c528cf7fb7853bffba7fe57c7e

Read more comments on GitHub >

github_iconTop Results From Across the Web

Evaluating BART on CNN/DM : How to process dataset #1391
Follow instructions here to download and process into data-files such that test.source and test.target has one line for each non-tokenized ...
Read more >
BARTSCORE: Evaluating Generated Text as Text Generation
Experimentally, we evaluate different variants of BARTSCORE from 7 perspectives on 16 datasets. BARTSCORE achieves the best performance in 16 of 22 test ......
Read more >
BARTScore: Evaluating Generated Text as Text Generation
(2) BARTScore can better support evaluation of generated text from different perspectives (e.g., ... We use BART fine-tuned on CNNDM dataset Hermann et...
Read more >
Text Summarization | Papers With Code
263 papers with code • 27 benchmarks • 66 datasets ... Trend, Dataset, Best Model, Paper, Code ... BARTScore: Evaluating Generated Text as...
Read more >
BARTSCORE: Evaluating Generated Text as Text Generation
35th Conference on Neural Information Processing Systems (NeurIPS 2021), Sydney, ... We use BART fine-tuned on CNNDM dataset [20], which is.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found