question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to download IWSLT datasets

See original GitHub issue

🐛 Bug

Describe the bug Unable to download IWSLT2016 or IWSLT2017 datasets.

To Reproduce Steps to reproduce the behavior:

from torchtext.datasets import IWSLT2016
train, valid, test = IWSLT2016()
src, tgt = next(iter(train))

The same error occurs when trying to use IWSLT2017.

Expected behavior The program returns the next src, tgt pair in the training data.

Screenshots Full error logs are in this gist.

Environment Included in gist above.

Additional context No additional context.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:18 (12 by maintainers)

github_iconTop GitHub Comments

4reactions
adzcaicommented, Apr 7, 2022

As a temporary fix, I’m just downloading the datasets manually via the links in the documentation:

Then you can put the downloaded .tgz file into the proper directory: ~/.torchtext/cache/IWSLT2016/ for 2016 and similar for 2017.

Then torchtext will recognize the files and not download from GDrive.

2reactions
lolzballscommented, May 26, 2022

@Nayef211 thanks, it does sound like exactly what I’m observing with IWSLT.

But I tried what is suggested in #1735 with (note the order of end_caching here and in the original code):

def _filter_clean_cache(cache_decompressed_dp, full_filepath, uncleaned_filename):

    cache_inner_decompressed_dp = cache_decompressed_dp.on_disk_cache(
        filepath_fn=partial(_return_full_filepath, full_filepath)
    )
    cache_inner_decompressed_dp = cache_inner_decompressed_dp.open_files(mode="b").load_from_tar()
    cache_inner_decompressed_dp = cache_inner_decompressed_dp.end_caching(mode="wb", same_filepath_fn=True)
    cache_inner_decompressed_dp = cache_inner_decompressed_dp.filter(partial(_filter_filename_fn, uncleaned_filename))
    cache_inner_decompressed_dp = cache_inner_decompressed_dp.map(partial(_clean_files_wrapper, full_filepath))
    return cache_inner_decompressed_dp

I still get the same behaviour: the inner load_from_tar() never gets iterated over.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Can't download IWSLT dataset to Google Colab #1098 - GitHub
This is the implementation. And I am using the Google Colab to be able to use the GPU. But the code for downloading...
Read more >
Unable to download IWSLT '14 datasets - Google Groups
Getting Started Tutorial on training a new model, but I was unable to download the IWSLT. dataset when running bash prepare-iwslt14.sh.
Read more >
Offline Speech Translation - IWSLT
The dataset is available here. Press the bottom ”click here to download the corpus”, and select version V2. IMPORTANT NOTE: the 2021 test...
Read more >
torchnlp.datasets.iwslt — PyTorch-NLP 0.5.0 documentation
Source code for torchnlp.datasets.iwslt. import os import xml.etree.ElementTree as ElementTree import io import glob from torchnlp.download import ...
Read more >
Datasets | TBD
Datasets · ImageNet1K · International Workshop on Spoken Language Translation (IWSLT) · Dataset for BERT · Workshop on Statistical Machine Translation (WMT) ·...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found