Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[stas/sam] Newsroom dataset wierdness

See original GitHub issue

get data

cd examples/seq2seq/
curl -L -o stas_data.tgz https://www.dropbox.com/sh/ctpx2pflb9nmt0n/AABRTDak-W06RD8KxuCOUdXla\?dl\=0 && unzip stas_data.tgz
tar -xzvf newsroom-test.tgz

from utils import Seq2SeqDataset
tok = PegasusTokenizer.from_pretrained('google/pegasus-newsroom')
ds = Seq2SeqDataset(tok, 'newsroom/data', tok.model_max_length, tok.model_max_length, type_path='test')
ds[659]['tgt_texts']
# "Insomniac's Pasquale Rotella has gone from throwing illegal raves in warehouses to throwing the nation's most iconic dance music festival in Las Vegas' Electric Daisy Carnival. "
ds[660]
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-17-7fbeab38f815> in <module>
----> 1 ds[660]

~/transformers_fork/examples/seq2seq/utils.py in __getitem__(self, index)
    248         tgt_line = linecache.getline(str(self.tgt_file), index).rstrip("\n")
    249         assert source_line, f"empty source line for index {index}"
--> 250         assert tgt_line, f"empty tgt line for index {index}"
    251         return {"tgt_texts": tgt_line, "src_texts": source_line, "id": index - 1}
    252 

AssertionError: empty tgt line for index 661

Clue:

In vim, the “Pasquale Rotella” line is 654 (off by 7/possible other bug), but it is 659/660 in the ds. similarly, linecache disagrees with wc -l about file lengths.

import linecache
src_lns = linecache.getlines(str(ds.src_file))
tgt_lns = linecache.getlines(str(ds.tgt_file))
assert len(src_lns) == len(tgt_lns),f'{ len(src_lns)} != {len(tgt_lns)}'
AssertionError: 108717 != 110412

Issue Analytics

State:
Created 3 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

stas00commented, Oct 14, 2020

Oh, I know - it’s \cM characters. Let me take care of it.

Google is still the best company to work for, according to Fortune
.^M<n>^M<n>The Mountain View-based tech giant earned the top 
^^^^^^^^^^^^^

0reactions

stas00commented, Oct 14, 2020

            src = re.sub(r'[\r\n]+', '<n>', src)
            tgt = re.sub(r'[\r\n]+', '<n>', tgt)

Top Results From Across the Web

Newsroom: A Dataset of 1.3 Million Summaries with Diverse ...

We present NEWSROOM, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news...

Newsroom: A Dataset of 1.3 Million ... - ACL Anthology

We present NEWSROOM, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news...

Cornell Newsroom Summarization Dataset - LIL Lab

Cornell Newsroom is a large dataset of 1.3 million articles and summaries designed for training and evaluating summarization systems.

Newsroom: A Dataset of 1.3 Million Summaries with ... - Vimeo

This is " Newsroom : A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies" by ACL on Vimeo, the home for high...

newsroom · Datasets at Hugging Face

Dataset Summary. NEWSROOM is a large dataset for training and evaluating summarization systems. It contains 1.3 million articles and summaries written by ...