Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CNN/DM : data preprocessing

See original GitHub issue

The link to the data of CNN/DM dataset is an already preprocessed dataset.

How can we reproduce similar dataset from the official .story files ?

Issue Analytics

State:
Created 4 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

donglixpcommented, Nov 14, 2019

@donglixp , can you please provide details regarding how to run these 3 steps?

def process_detokenize(chunk):
    twd = TreebankWordDetokenizer()
    tokenizer = BertTokenizer.from_pretrained(
        args.bert_model, do_lower_case=args.do_lower_case)
    r_list = []
    for idx, line in chunk:
        line = line.strip().replace('``', '"').replace('\'\'', '"').replace('`','\'')
        s_list = [twd.detokenize(x.strip().split(
            ' '), convert_parentheses=True) for x in line.split('<S_SEP>')]
        tk_list = [tokenizer.tokenize(s) for s in s_list]
        r_list.append((idx, s_list, tk_list))
    return r_list


def read_tokenized_file(fn):
    with open(fn, 'r', encoding='utf-8') as f_in:
        l_list = [l for l in f_in]
    num_pool = min(args.processes, len(l_list))
    p = Pool(num_pool)
    chunk_list = partition_all(
        int(len(l_list)/num_pool), list(enumerate(l_list)))
    r_list = []
    with tqdm(total=len(l_list)) as pbar:
        for r in p.imap_unordered(process_detokenize, chunk_list):
            r_list.extend(r)
            pbar.update(len(r))
    p.close()
    p.join()
    r_list.sort(key=lambda x: x[0])
    return [x[1] for x in r_list], [x[2] for x in r_list]


def append_sep(s_list):
    r_list = []
    for i, s in enumerate(s_list):
        r_list.append(s)
        r_list.append('[SEP_{0}]'.format(min(9, i)))
    return r_list[:-1]


## print('convert into src/tgt format')
with open(os.path.join(args.output_dir, split_out+'.src'), 'w', encoding='utf-8') as f_src, open(os.path.join(args.output_dir, split_out+'.tgt'), 'w', encoding='utf-8') as f_tgt, open(os.path.join(args.output_dir, split_out+'.slv'), 'w', encoding='utf-8') as f_slv:
    for src, tgt, lb in tqdm(zip(article_tk, summary_tk, label)):
        # source
        src_tokenized = [' '.join(s) for s in src]
        if args.src_sep_token:
            f_src.write(' '.join(append_sep(src_tokenized)))
        else:
            f_src.write(' '.join(src_tokenized))
        f_src.write('\n')
        # target (silver)
        slv_tokenized = [s for s, extract_flag in zip(
            src_tokenized, lb) if extract_flag]
        f_slv.write(' [X_SEP] '.join(slv_tokenized))
        f_slv.write('\n')
        # target (gold)
        f_tgt.write(' [X_SEP] '.join(
            [' '.join(s) for s in tgt]))
        f_tgt.write('\n')

The input should have been split by “<S_SEP>”.

0reactions

ranjeetdscommented, Nov 25, 2019

@tahmedge Did you use above script? If yes, could you please share implementation of the same?

Top Results From Across the Web

cnn_dailymail · Datasets at Hugging Face

0 provided a non-anonymized version of the data, whereas both the previous versions were preprocessed to replace named entities with unique identifier labels....

cnndm.py

URLS = ["https://s3.amazonaws.com/opennmt-models/Summary/cnndm.tar.gz"] def _setup_datasets( url, top_n=-1, local_cache_path=".data", ...

CNN/Daily Mail Dataset - Papers With Code

CNN/Daily Mail is a dataset for text summarization. Human generated abstractive summary bullets were generated from news stories in CNN and Daily Mail ......

Summarization — OpenNMT-py documentation

cnndm.yaml ## Where the samples will be written save_data: cnndm/run/example ... True # Corpus opts: data: cnndm: path_src: cnndm/train.txt.src path_tgt: ...

cnn_dailymail | TensorFlow Datasets

CNN/DailyMail non-anonymized summarization dataset. There are two features: - article: text of news article, used as the document to be summarized - highlights: ......