question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[stas/sam] Newsroom dataset wierdness

See original GitHub issue

get data

cd examples/seq2seq/
curl -L -o stas_data.tgz https://www.dropbox.com/sh/ctpx2pflb9nmt0n/AABRTDak-W06RD8KxuCOUdXla\?dl\=0 && unzip stas_data.tgz
tar -xzvf newsroom-test.tgz
from utils import Seq2SeqDataset
tok = PegasusTokenizer.from_pretrained('google/pegasus-newsroom')
ds = Seq2SeqDataset(tok, 'newsroom/data', tok.model_max_length, tok.model_max_length, type_path='test')
ds[659]['tgt_texts']
# "Insomniac's Pasquale Rotella has gone from throwing illegal raves in warehouses to throwing the nation's most iconic dance music festival in Las Vegas' Electric Daisy Carnival. "
ds[660]
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-17-7fbeab38f815> in <module>
----> 1 ds[660]

~/transformers_fork/examples/seq2seq/utils.py in __getitem__(self, index)
    248         tgt_line = linecache.getline(str(self.tgt_file), index).rstrip("\n")
    249         assert source_line, f"empty source line for index {index}"
--> 250         assert tgt_line, f"empty tgt line for index {index}"
    251         return {"tgt_texts": tgt_line, "src_texts": source_line, "id": index - 1}
    252 

AssertionError: empty tgt line for index 661

Clue:

In vim, the “Pasquale Rotella” line is 654 (off by 7/possible other bug), but it is 659/660 in the ds. similarly, linecache disagrees with wc -l about file lengths.

import linecache
src_lns = linecache.getlines(str(ds.src_file))
tgt_lns = linecache.getlines(str(ds.tgt_file))
assert len(src_lns) == len(tgt_lns),f'{ len(src_lns)} != {len(tgt_lns)}'
AssertionError: 108717 != 110412

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
stas00commented, Oct 14, 2020

Oh, I know - it’s \cM characters. Let me take care of it.

Google is still the best company to work for, according to Fortune
.^M<n>^M<n>The Mountain View-based tech giant earned the top 
^^^^^^^^^^^^^
0reactions
stas00commented, Oct 14, 2020
            src = re.sub(r'[\r\n]+', '<n>', src)
            tgt = re.sub(r'[\r\n]+', '<n>', tgt)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Newsroom: A Dataset of 1.3 Million Summaries with Diverse ...
We present NEWSROOM, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news...
Read more >
Newsroom: A Dataset of 1.3 Million ... - ACL Anthology
We present NEWSROOM, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news...
Read more >
Cornell Newsroom Summarization Dataset - LIL Lab
Cornell Newsroom is a large dataset of 1.3 million articles and summaries designed for training and evaluating summarization systems.
Read more >
Newsroom: A Dataset of 1.3 Million Summaries with ... - Vimeo
This is " Newsroom : A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies" by ACL on Vimeo, the home for high...
Read more >
newsroom · Datasets at Hugging Face
Dataset Summary. NEWSROOM is a large dataset for training and evaluating summarization systems. It contains 1.3 million articles and summaries written by ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found