Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

getitem() not implemented?

See original GitHub issue

❓ Questions and Help

Description

For some reason, calling getitem() on the Torchtext Multi30k dataset returns a NotImplementedError for me, despite the dataset being properly downloaded and calling next(iter()) on it providing valid output. Can someone help me understand this? I need the method as I’m wrapping the dataset in a larger dataset class and will have to call getitem() explicity to perform joint pre-processing with other dataset products.

Sample

m30k = torchtext.datasets.Multi30k(root='.\Data', split='test', language_pair=('en', 'de')) ; m30k.__getitem__(0)

Issue Analytics

State:
Created 2 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

eripcommented, Jan 25, 2022

Batch size can certainly be element dependent in NLP cases where you may want to form batches based on the length of examples (like max-token post-pad batching).

Some datasets in torchtext are modestly sized, but others (like CC100 soon) are significantly larger and iterable-style is the only way to realistically consume them. Additionally, datapipes in the pytorch ecosystem prefer iterable-style which enables slightly cleaner and intent-revealing semantics at the dataset level (vs. at the loader level).

0reactions

ShairozScommented, Jan 24, 2022

Can I understand the reasoning behind implementing torchtext datasets as iterable-style instead of map-style? Many significantly larger image datasets (such as Imagenet and CIFAR-10) are implemented as iterable-style in torchvision (indeed loading the entire dataset into memory is not a requirement of the iterable-style anyway), and I’m not sure why batch size would be element dependent in this case. Those are really the only two cases where it seems convention denotes an iterable-style dataset be used.