question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CNN/DM : data preprocessing

See original GitHub issue

The link to the data of CNN/DM dataset is an already preprocessed dataset.

How can we reproduce similar dataset from the official .story files ?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
donglixpcommented, Nov 14, 2019

@donglixp , can you please provide details regarding how to run these 3 steps?

def process_detokenize(chunk):
    twd = TreebankWordDetokenizer()
    tokenizer = BertTokenizer.from_pretrained(
        args.bert_model, do_lower_case=args.do_lower_case)
    r_list = []
    for idx, line in chunk:
        line = line.strip().replace('``', '"').replace('\'\'', '"').replace('`','\'')
        s_list = [twd.detokenize(x.strip().split(
            ' '), convert_parentheses=True) for x in line.split('<S_SEP>')]
        tk_list = [tokenizer.tokenize(s) for s in s_list]
        r_list.append((idx, s_list, tk_list))
    return r_list


def read_tokenized_file(fn):
    with open(fn, 'r', encoding='utf-8') as f_in:
        l_list = [l for l in f_in]
    num_pool = min(args.processes, len(l_list))
    p = Pool(num_pool)
    chunk_list = partition_all(
        int(len(l_list)/num_pool), list(enumerate(l_list)))
    r_list = []
    with tqdm(total=len(l_list)) as pbar:
        for r in p.imap_unordered(process_detokenize, chunk_list):
            r_list.extend(r)
            pbar.update(len(r))
    p.close()
    p.join()
    r_list.sort(key=lambda x: x[0])
    return [x[1] for x in r_list], [x[2] for x in r_list]


def append_sep(s_list):
    r_list = []
    for i, s in enumerate(s_list):
        r_list.append(s)
        r_list.append('[SEP_{0}]'.format(min(9, i)))
    return r_list[:-1]


## print('convert into src/tgt format')
with open(os.path.join(args.output_dir, split_out+'.src'), 'w', encoding='utf-8') as f_src, open(os.path.join(args.output_dir, split_out+'.tgt'), 'w', encoding='utf-8') as f_tgt, open(os.path.join(args.output_dir, split_out+'.slv'), 'w', encoding='utf-8') as f_slv:
    for src, tgt, lb in tqdm(zip(article_tk, summary_tk, label)):
        # source
        src_tokenized = [' '.join(s) for s in src]
        if args.src_sep_token:
            f_src.write(' '.join(append_sep(src_tokenized)))
        else:
            f_src.write(' '.join(src_tokenized))
        f_src.write('\n')
        # target (silver)
        slv_tokenized = [s for s, extract_flag in zip(
            src_tokenized, lb) if extract_flag]
        f_slv.write(' [X_SEP] '.join(slv_tokenized))
        f_slv.write('\n')
        # target (gold)
        f_tgt.write(' [X_SEP] '.join(
            [' '.join(s) for s in tgt]))
        f_tgt.write('\n')

The input should have been split by “<S_SEP>”.

0reactions
ranjeetdscommented, Nov 25, 2019

@tahmedge Did you use above script? If yes, could you please share implementation of the same?

Read more comments on GitHub >

github_iconTop Results From Across the Web

cnn_dailymail · Datasets at Hugging Face
0 provided a non-anonymized version of the data, whereas both the previous versions were preprocessed to replace named entities with unique identifier labels....
Read more >
cnndm.py
URLS = ["https://s3.amazonaws.com/opennmt-models/Summary/cnndm.tar.gz"] def _setup_datasets( url, top_n=-1, local_cache_path=".data", ...
Read more >
CNN/Daily Mail Dataset - Papers With Code
CNN/Daily Mail is a dataset for text summarization. Human generated abstractive summary bullets were generated from news stories in CNN and Daily Mail ......
Read more >
Summarization — OpenNMT-py documentation
cnndm.yaml ## Where the samples will be written save_data: cnndm/run/example ... True # Corpus opts: data: cnndm: path_src: cnndm/train.txt.src path_tgt: ...
Read more >
cnn_dailymail | TensorFlow Datasets
CNN/DailyMail non-anonymized summarization dataset. There are two features: - article: text of news article, used as the document to be summarized - highlights: ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found