Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi30K dataset link is broken

See original GitHub issue

The link to Multi30K dataset at http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz is broken: https://github.com/pytorch/text/blob/73bf4fa8cedc12d910ab76190e446bd2e47a8325/torchtext/datasets/multi30k.py#L16

Issue Analytics

State:
Created a year ago
Comments:12 (3 by maintainers)

Top GitHub Comments

6reactions

neychevcommented, Jun 15, 2022

Found a local copy of the dataset and uploaded it to github (it’s rather small). For now it is available via this link: https://github.com/neychev/small_DL_repo/tree/master/datasets/Multi30k

Just in case, all rights belong to the original authors of the dataset, this is only a temporal copy for convenience.

3reactions

neychevcommented, Jun 26, 2022

Thanks, @Nayef211, @rrmina !

No idea what’s exactly wrong with the data, the files above were located in ~/.torchtext/cache/Multi30k of one of my students.

I’ve tried to simply rename the archive (according to the name in torchtext docs) and files in it and change MD5 to the correct one and it seems to work.

Including the approach suggested by @Nayef211, which is way more elegant, the final algorithm should be the following:

from torchtext.datasets import multi30k, Multi30k

# Update URLs to point to data stored by user
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"
multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"

# Update hash since there is a discrepancy between user hosted test split and that of the test split in the original dataset 
multi30k.MD5["test"] = "6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36"

data_train = Multi30k(split='train')
data_val = Multi30k(split='valid')
data_test = Multi30k(split='test')

Test data has 1000 sentences, which seems correct.