Add bucket sampler
See original GitHub issue🚀 Feature
Motivation
The legacy BucketIterator
was convenient because it could batch samples by length to minimize padding. It had many disadvantages because of its API and non-comformance with other parts of the pytorch data{sets,loader} ecosystem. It would be nice if torchtext supported the spirit of the BucketIterator
by way of a Sampler
.
Pitch
A sampler with the ability to specify maximum bucket size should be added similar to those in torchnlp and allennlp. This can be used with existing datasets in torchtext but as a kwarg to the pytorch DataLoader
so sampling minimizes padding.
Alternatives
Users who want this functionality need to implement their own samplers.
Additional context
The migration guide contains a prototype of this feature without it being a first-class part of the torchtext repo. A proposed implementation can be found here.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (9 by maintainers)
Top GitHub Comments
I’d definitely be interested in contributing!
Ah, I misread the code – this makes total sense. I agree that this will have more impact than shuffling within a batch. I still think it is good to give an option to the user about whether there should be shuffling at all. maybe just
shuffle: bool = True
by default?Yes, exactly. See here. There’s probably some error checking to add here (what if
lengths
is the empty list after filtering? 🙀 ), but otherwise seems OK.