question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi30K dataset link is broken

See original GitHub issue

The link to Multi30K dataset at http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz is broken: https://github.com/pytorch/text/blob/73bf4fa8cedc12d910ab76190e446bd2e47a8325/torchtext/datasets/multi30k.py#L16

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:12 (3 by maintainers)

github_iconTop GitHub Comments

6reactions
neychevcommented, Jun 15, 2022

Found a local copy of the dataset and uploaded it to github (it’s rather small). For now it is available via this link: https://github.com/neychev/small_DL_repo/tree/master/datasets/Multi30k

Just in case, all rights belong to the original authors of the dataset, this is only a temporal copy for convenience.

3reactions
neychevcommented, Jun 26, 2022

Thanks, @Nayef211, @rrmina !

No idea what’s exactly wrong with the data, the files above were located in ~/.torchtext/cache/Multi30k of one of my students.

I’ve tried to simply rename the archive (according to the name in torchtext docs) and files in it and change MD5 to the correct one and it seems to work.

Including the approach suggested by @Nayef211, which is way more elegant, the final algorithm should be the following:

from torchtext.datasets import multi30k, Multi30k

# Update URLs to point to data stored by user
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"
multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"

# Update hash since there is a discrepancy between user hosted test split and that of the test split in the original dataset 
multi30k.MD5["test"] = "6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36"

data_train = Multi30k(split='train')
data_val = Multi30k(split='valid')
data_test = Multi30k(split='test')

Test data has 1000 sentences, which seems correct.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Machine Translation with Multi30k: De -> En | Kaggle
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting ... train_data, valid_data, test_data = torchtext.datasets.
Read more >
Translation datasets not automatically downloading · Issue #312
It just doesn't seem to automatically download the data for both the Multi30k and WMT14 datasets. PyTorch version: 0.3.1 TorchText version 0.2.3.
Read more >
Build_vocab_from_iterator does not work in notebook - nlp
But it fails either, saying " The requested URL could not be retrieved". I guess the database server is down or their network...
Read more >
Introduction to seq2seq models - Jake Tae
We will be using the Multi30k dataset, which contains ... for batch in train_iterator: print(batch.src[0]) print(batch.src[1].shape) break.
Read more >
Language Translation with nn.Transformer and torchtext
We will use Multi30k dataset from torchtext library_ that yields a pair of ... the URLs for the dataset since the links to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found