question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can't download IWSLT dataset to Google Colab

See original GitHub issue

I am experimenting with an implementation of the “Attention is All You Need” paper. This is the implementation. And I am using the Google Colab to be able to use the GPU. But the code for downloading the dataset using PyTorch results in error.

This is the code I used:

from torchtext import data, datasets

import spacy
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

BOS_WORD = '<s>'
EOS_WORD = '</s>'
BLANK_WORD = "<blank>"
SRC = data.Field(tokenize=tokenize_de, pad_token=BLANK_WORD)
TGT = data.Field(tokenize=tokenize_en, init_token = BOS_WORD, 
                 eos_token = EOS_WORD, pad_token=BLANK_WORD)

MAX_LEN = 100
train, val, test = datasets.IWSLT.splits(
    exts=('.de', '.en'), fields=(SRC, TGT), 
    filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and 
        len(vars(x)['trg']) <= MAX_LEN)
MIN_FREQ = 2
SRC.build_vocab(train.src, min_freq=MIN_FREQ)
TGT.build_vocab(train.trg, min_freq=MIN_FREQ)

And this is the error I got:

OSError                                   Traceback (most recent call last)
/usr/lib/python3.6/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs)
   1644         try:
-> 1645             t = cls.taropen(name, mode, fileobj, **kwargs)
   1646         except OSError:

12 frames
OSError: Not a gzipped file (b'<!')

During handling of the above exception, another exception occurred:

ReadError                                 Traceback (most recent call last)
/usr/lib/python3.6/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs)
   1647             fileobj.close()
   1648             if mode == 'r':
-> 1649                 raise ReadError("not a gzip file")
   1650             raise
   1651         except:

ReadError: not a gzip file

Is this know issue? Or am I missing something?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

5reactions
Matrix-Jiancommented, Sep 25, 2021

I met the same problem and fixed it now. I thought that the url of dataset in the function: ‘datasets.IWSLT.splits’ is outdated and it lead to download an error file ‘de-en.tgz’. So I download the IWSLT2016 datasets from here. I found the file ‘de-en.tgz’ in the path: ‘2016-01/2016-01/texts/de/en/’ after unzipping. And replace ‘de-en.tgz’ in ‘./data/iwslt/’ with it. Finally, run the code again.

4reactions
zhangguanheng66commented, Dec 29, 2020
Read more comments on GitHub >

github_iconTop Results From Across the Web

Pro-Tip for downloading Custom Datasets in Colab Environment
I want to quickly share how I work and use the dataset for us in google Colab environment. Actually, the process is meant...
Read more >
Sentiment Analysis + Back translation - Colaboratory
from google.colab import auth ... Download Foody Sentiment Analysis dataset ... This in comparison with data for our translation task (IWSLT'15) is in...
Read more >
torchtext.datasets — torchtext 0.4.0 documentation
The WMT 2014 English-German dataset, as preprocessed by Google Brain. Though this download contains test sets from 2015 and 2016, the train set...
Read more >
Can't download c4 dataset with Dataflow in colab
UPD: tried the same approach with the compute instance - same result. google-colaboratory · apache-beam · tensorflow-datasets · dataflow · Share.
Read more >
Unable to download all data from kaggle to colab
Hi, run this hope this will work. !pip install --upgrade --force-reinstall --no-deps kaggle. This is due to mixing versions of python. Although Google...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found