Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can't download IWSLT dataset to Google Colab

See original GitHub issue

I am experimenting with an implementation of the “Attention is All You Need” paper. This is the implementation. And I am using the Google Colab to be able to use the GPU. But the code for downloading the dataset using PyTorch results in error.

This is the code I used:

from torchtext import data, datasets

import spacy
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

BOS_WORD = '<s>'
EOS_WORD = '</s>'
BLANK_WORD = "<blank>"
SRC = data.Field(tokenize=tokenize_de, pad_token=BLANK_WORD)
TGT = data.Field(tokenize=tokenize_en, init_token = BOS_WORD, 
                 eos_token = EOS_WORD, pad_token=BLANK_WORD)

MAX_LEN = 100
train, val, test = datasets.IWSLT.splits(
    exts=('.de', '.en'), fields=(SRC, TGT), 
    filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and 
        len(vars(x)['trg']) <= MAX_LEN)
MIN_FREQ = 2
SRC.build_vocab(train.src, min_freq=MIN_FREQ)
TGT.build_vocab(train.trg, min_freq=MIN_FREQ)

And this is the error I got:

OSError                                   Traceback (most recent call last)
/usr/lib/python3.6/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs)
   1644         try:
-> 1645             t = cls.taropen(name, mode, fileobj, **kwargs)
   1646         except OSError:

12 frames
OSError: Not a gzipped file (b'<!')

During handling of the above exception, another exception occurred:

ReadError                                 Traceback (most recent call last)
/usr/lib/python3.6/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs)
   1647             fileobj.close()
   1648             if mode == 'r':
-> 1649                 raise ReadError("not a gzip file")
   1650             raise
   1651         except:

ReadError: not a gzip file

Is this know issue? Or am I missing something?

Issue Analytics

State:
Created 3 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

5reactions

Matrix-Jiancommented, Sep 25, 2021

I met the same problem and fixed it now. I thought that the url of dataset in the function: ‘datasets.IWSLT.splits’ is outdated and it lead to download an error file ‘de-en.tgz’. So I download the IWSLT2016 datasets from here. I found the file ‘de-en.tgz’ in the path: ‘2016-01/2016-01/texts/de/en/’ after unzipping. And replace ‘de-en.tgz’ in ‘./data/iwslt/’ with it. Finally, run the code again.

4reactions

zhangguanheng66commented, Dec 29, 2020

Fixed by https://github.com/pytorch/text/pull/1115

Top Results From Across the Web

Pro-Tip for downloading Custom Datasets in Colab Environment

I want to quickly share how I work and use the dataset for us in google Colab environment. Actually, the process is meant...

Sentiment Analysis + Back translation - Colaboratory

from google.colab import auth ... Download Foody Sentiment Analysis dataset ... This in comparison with data for our translation task (IWSLT'15) is in...

torchtext.datasets — torchtext 0.4.0 documentation

The WMT 2014 English-German dataset, as preprocessed by Google Brain. Though this download contains test sets from 2015 and 2016, the train set...

Can't download c4 dataset with Dataflow in colab

UPD: tried the same approach with the compute instance - same result. google-colaboratory · apache-beam · tensorflow-datasets · dataflow · Share.

Unable to download all data from kaggle to colab

Hi, run this hope this will work. !pip install --upgrade --force-reinstall --no-deps kaggle. This is due to mixing versions of python. Although Google...