Can't download IWSLT dataset to Google Colab
See original GitHub issueI am experimenting with an implementation of the “Attention is All You Need” paper. This is the implementation. And I am using the Google Colab to be able to use the GPU. But the code for downloading the dataset using PyTorch results in error.
This is the code I used:
from torchtext import data, datasets
import spacy
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')
def tokenize_de(text):
return [tok.text for tok in spacy_de.tokenizer(text)]
def tokenize_en(text):
return [tok.text for tok in spacy_en.tokenizer(text)]
BOS_WORD = '<s>'
EOS_WORD = '</s>'
BLANK_WORD = "<blank>"
SRC = data.Field(tokenize=tokenize_de, pad_token=BLANK_WORD)
TGT = data.Field(tokenize=tokenize_en, init_token = BOS_WORD,
eos_token = EOS_WORD, pad_token=BLANK_WORD)
MAX_LEN = 100
train, val, test = datasets.IWSLT.splits(
exts=('.de', '.en'), fields=(SRC, TGT),
filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and
len(vars(x)['trg']) <= MAX_LEN)
MIN_FREQ = 2
SRC.build_vocab(train.src, min_freq=MIN_FREQ)
TGT.build_vocab(train.trg, min_freq=MIN_FREQ)
And this is the error I got:
OSError Traceback (most recent call last)
/usr/lib/python3.6/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs)
1644 try:
-> 1645 t = cls.taropen(name, mode, fileobj, **kwargs)
1646 except OSError:
12 frames
OSError: Not a gzipped file (b'<!')
During handling of the above exception, another exception occurred:
ReadError Traceback (most recent call last)
/usr/lib/python3.6/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs)
1647 fileobj.close()
1648 if mode == 'r':
-> 1649 raise ReadError("not a gzip file")
1650 raise
1651 except:
ReadError: not a gzip file
Is this know issue? Or am I missing something?
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Pro-Tip for downloading Custom Datasets in Colab Environment
I want to quickly share how I work and use the dataset for us in google Colab environment. Actually, the process is meant...
Read more >Sentiment Analysis + Back translation - Colaboratory
from google.colab import auth ... Download Foody Sentiment Analysis dataset ... This in comparison with data for our translation task (IWSLT'15) is in...
Read more >torchtext.datasets — torchtext 0.4.0 documentation
The WMT 2014 English-German dataset, as preprocessed by Google Brain. Though this download contains test sets from 2015 and 2016, the train set...
Read more >Can't download c4 dataset with Dataflow in colab
UPD: tried the same approach with the compute instance - same result. google-colaboratory · apache-beam · tensorflow-datasets · dataflow · Share.
Read more >Unable to download all data from kaggle to colab
Hi, run this hope this will work. !pip install --upgrade --force-reinstall --no-deps kaggle. This is due to mixing versions of python. Although Google...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I met the same problem and fixed it now. I thought that the url of dataset in the function: ‘datasets.IWSLT.splits’ is outdated and it lead to download an error file ‘de-en.tgz’. So I download the IWSLT2016 datasets from here. I found the file ‘de-en.tgz’ in the path: ‘2016-01/2016-01/texts/de/en/’ after unzipping. And replace ‘de-en.tgz’ in ‘./data/iwslt/’ with it. Finally, run the code again.
Fixed by https://github.com/pytorch/text/pull/1115