[stas/sam] Newsroom dataset wierdness
See original GitHub issueget data
cd examples/seq2seq/
curl -L -o stas_data.tgz https://www.dropbox.com/sh/ctpx2pflb9nmt0n/AABRTDak-W06RD8KxuCOUdXla\?dl\=0 && unzip stas_data.tgz
tar -xzvf newsroom-test.tgz
from utils import Seq2SeqDataset
tok = PegasusTokenizer.from_pretrained('google/pegasus-newsroom')
ds = Seq2SeqDataset(tok, 'newsroom/data', tok.model_max_length, tok.model_max_length, type_path='test')
ds[659]['tgt_texts']
# "Insomniac's Pasquale Rotella has gone from throwing illegal raves in warehouses to throwing the nation's most iconic dance music festival in Las Vegas' Electric Daisy Carnival. "
ds[660]
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-17-7fbeab38f815> in <module>
----> 1 ds[660]
~/transformers_fork/examples/seq2seq/utils.py in __getitem__(self, index)
248 tgt_line = linecache.getline(str(self.tgt_file), index).rstrip("\n")
249 assert source_line, f"empty source line for index {index}"
--> 250 assert tgt_line, f"empty tgt line for index {index}"
251 return {"tgt_texts": tgt_line, "src_texts": source_line, "id": index - 1}
252
AssertionError: empty tgt line for index 661
Clue:
In vim, the “Pasquale Rotella” line is 654 (off by 7/possible other bug), but it is 659/660 in the ds.
similarly, linecache
disagrees with wc -l
about file lengths.
import linecache
src_lns = linecache.getlines(str(ds.src_file))
tgt_lns = linecache.getlines(str(ds.tgt_file))
assert len(src_lns) == len(tgt_lns),f'{ len(src_lns)} != {len(tgt_lns)}'
AssertionError: 108717 != 110412
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
Newsroom: A Dataset of 1.3 Million Summaries with Diverse ...
We present NEWSROOM, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news...
Read more >Newsroom: A Dataset of 1.3 Million ... - ACL Anthology
We present NEWSROOM, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news...
Read more >Cornell Newsroom Summarization Dataset - LIL Lab
Cornell Newsroom is a large dataset of 1.3 million articles and summaries designed for training and evaluating summarization systems.
Read more >Newsroom: A Dataset of 1.3 Million Summaries with ... - Vimeo
This is " Newsroom : A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies" by ACL on Vimeo, the home for high...
Read more >newsroom · Datasets at Hugging Face
Dataset Summary. NEWSROOM is a large dataset for training and evaluating summarization systems. It contains 1.3 million articles and summaries written by ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Oh, I know - it’s \cM characters. Let me take care of it.