Multi30K dataset link is broken
See original GitHub issueThe link to Multi30K dataset at http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz
is broken: https://github.com/pytorch/text/blob/73bf4fa8cedc12d910ab76190e446bd2e47a8325/torchtext/datasets/multi30k.py#L16
Issue Analytics
- State:
- Created a year ago
- Comments:12 (3 by maintainers)
Top Results From Across the Web
Machine Translation with Multi30k: De -> En | Kaggle
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting ... train_data, valid_data, test_data = torchtext.datasets.
Read more >Translation datasets not automatically downloading · Issue #312
It just doesn't seem to automatically download the data for both the Multi30k and WMT14 datasets. PyTorch version: 0.3.1 TorchText version 0.2.3.
Read more >Build_vocab_from_iterator does not work in notebook - nlp
But it fails either, saying " The requested URL could not be retrieved". I guess the database server is down or their network...
Read more >Introduction to seq2seq models - Jake Tae
We will be using the Multi30k dataset, which contains ... for batch in train_iterator: print(batch.src[0]) print(batch.src[1].shape) break.
Read more >Language Translation with nn.Transformer and torchtext
We will use Multi30k dataset from torchtext library_ that yields a pair of ... the URLs for the dataset since the links to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Found a local copy of the dataset and uploaded it to github (it’s rather small). For now it is available via this link: https://github.com/neychev/small_DL_repo/tree/master/datasets/Multi30k
Just in case, all rights belong to the original authors of the dataset, this is only a temporal copy for convenience.
Thanks, @Nayef211, @rrmina !
No idea what’s exactly wrong with the data, the files above were located in
~/.torchtext/cache/Multi30k
of one of my students.I’ve tried to simply rename the archive (according to the name in torchtext docs) and files in it and change MD5 to the correct one and it seems to work.
Including the approach suggested by @Nayef211, which is way more elegant, the final algorithm should be the following:
Test data has 1000 sentences, which seems correct.