Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Translation datasets not automatically downloading

See original GitHub issue

Code:


from torchtext.data import Field
from torchtext.datasets import Multi30k

DE = Field(init_token='<sos>', eos_token='<eos>')
EN = Field(init_token='<sos>', eos_token='<eos>')

train, val, test = Multi30k.splits(exts=('.de', '.en'), fields=(DE, EN))

Error:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-3-637d49b65435> in <module>()
----> 1 train, val, test = Multi30k.splits(exts=('.de', '.en'), fields=(DE, EN))

~/miniconda3/envs/pytorch/lib/python3.6/site-packages/torchtext/datasets/translation.py in splits(cls, exts, fields, root, train, validation, test, **kwargs)
     99         """
    100         return super(Multi30k, cls).splits(
--> 101             exts, fields, root, train, validation, test, **kwargs)
    102 
    103 

~/miniconda3/envs/pytorch/lib/python3.6/site-packages/torchtext/datasets/translation.py in splits(cls, exts, fields, path, root, train, validation, test, **kwargs)
     62 
     63         train_data = None if train is None else cls(
---> 64             os.path.join(path, train), exts, fields, **kwargs)
     65         val_data = None if validation is None else cls(
     66             os.path.join(path, validation), exts, fields, **kwargs)

~/miniconda3/envs/pytorch/lib/python3.6/site-packages/torchtext/datasets/translation.py in __init__(self, path, exts, fields, **kwargs)
     31 
     32         examples = []
---> 33         with open(src_path) as src_file, open(trg_path) as trg_file:
     34             for src_line, trg_line in zip(src_file, trg_file):
     35                 src_line, trg_line = src_line.strip(), trg_line.strip()

FileNotFoundError: [Errno 2] No such file or directory: '.data/val.de'

It just doesn’t seem to automatically download the data for both the Multi30k and WMT14 datasets.

PyTorch version: 0.3.1 TorchText version 0.2.3

EDIT

I have downgraded my TorchText to version 0.2.1 and I do not get the error, had a quick look at the commits between 0.2.1 and 0.2.3 and couldn’t figure out which commit introduced the break.

Issue Analytics

State:
Created 5 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

5reactions

n0obcodercommented, Feb 26, 2019

@mttk i figured out that i had to add the ‘root’ argument in the split function. so i modified the line of code to train_data , test_data = datasets.IMDB.splits(TEXT, LABEL, root = ‘data’) #the data will be downloaded in the root dir and then the data got downloaded in the specified root directory. thnaks anyways 😄

2reactions

aa1607commented, Jul 18, 2018

I got around this quite easily by downloading with Multi30k.download(DATAROOT) and then just using TranslationDataset.splits instead of Multi30k.splits. Pass the rootpath to the path argument instead of the root argument

from torchtext.datasets import TranslationDataset, Multi30k
ROOT = '~/Python/DATASETS/Multi30k/'
Multi30k.download(ROOT)

(trnset, valset, testset) = TranslationDataset.splits(   
                                      path       = ROOT,  
                                      exts       = ['.en', '.de'],   
                                      fields     = [('src', srcfield), ('trg',tgtfield)],
                                      test       = 'test2016'
                                      )

I use this function (after downloading) to preprocess the data and get the iterators



import spacy
from torchtext.data import BucketIterator, interleave_keys, Field 
from onmt.inputters import OrderedIterator




def  prep_torchtext_multi30k( 
                          dataroot = '~/Python/DATASETS/Multi30k/',
                          maxwords = 12000, 
                          bsize =32, 
                          langs = ['de','en'],
                          exts =  ['.en','.de'],
                          ):

    # modifies dataset loader from https://github.com/A-Jacobson/minimal-nmt
    
    
    try:     de, en  = [ load_multi30k_torchtext.nlp.get(lang) for lang in langs]
    except:  de, en  = [ spacy.load(lang, disable=['tagger', 'parser', 'ner']) for lang in langs] 
    prep_torchtext_multi30k.nlp = {'en':en, 'de':de}   # repeatedly loading spacy models can use lots of mem
        
    
    def tok_src(text): return [tok.text for tok in de.tokenizer(text) if not tok.is_space]
    def tok_tgt(text): return [tok.text for tok in en.tokenizer(text) if not tok.is_space]

    SRC = Field( tokenize = tok_src, init_token='<s>',  eos_token='</s>' )
    TGT = Field( tokenize = tok_tgt, init_token='<s>',  eos_token='</s>' )
    
    trnset, valset, testset = TranslationDataset.splits(   
                                      path       = dataroot,  
                                      exts       = exts,   
                                      fields     = [('src', SRC), ('trg',TGT)],
                                      train      = 'train', 
                                      validation = 'val', 
                                      test       = 'test2016')

    for (nm, field) in [('src', SRC), ('trg',TGT)]:  
        trnsubset = getattr(trnset, nm) 
        field.build_vocab( trnsubset, max_size = maxwords)
    
             
    # ONMT's OrderedIterator --> subclasses BucketIterator but better at packing batches together. 
    # also want to use torchtext's  interleave_keys -> minimizes padding on both src and tgt sides

    trniter, valiter, tstiter = OrderedIterator.splits(   
                                   datasets = [trnset, valset, testset], 
                                   batch_size = bsize, 
                                   sort_key = lambda ex: interleave_keys(len(ex.src), len(ex.trg)),
                                   device='cuda' )

    return (trnset, valset, testset), (trniter, valiter, tstiter), (SRC.vocab, TGT.vocab)

Top Results From Across the Web

Translation datasets not automatically downloading -

I have downgraded my TorchText to version 0.2.1 and I do not get the error, had a quick look at the commits between...

msr_zhen_translation_parity · Datasets at Hugging Face

This dataset contains 6 extra English translations to Chinese-English language pair of WMT17. Dataset Structure. Data Instances.

wmt19_translate | TensorFlow Datasets

Translate dataset based on the data from statmt.org. ... Some of the wmt configs here, require a manual download. ... Auto-cached (documentation): No....

10.5. Machine Translation and the Dataset

Downloading and Preprocessing the Dataset¶. To begin, we download an English-French dataset that consists of bilingual sentence pairs from the Tatoeba Project.

How to Prepare a French-to-English Dataset for Machine ...

This tutorial is divided into 5 parts; they are: Europarl Machine Translation Dataset; Download French-English Dataset; Load Dataset; Clean ...