feat(BucketIterator): add num_workers
See original GitHub issue🚀 Feature
Add num_workers
to BucketIterator
Motivation
While torchtext
is designed for test, it is also the best thing to use for sequence data.
I have sequence data in the form of sign language poses, which falls under the category of language, and I want to batch it the same way I would batch text - sorted by length.
For sign language, the dataset needs to handle data augmentation (my specific use case) and current data augmentation libraries like imgaug
and albumnations
are slow (see issues https://github.com/aleju/imgaug/issues/635 and https://github.com/albumentations-team/albumentations/issues/554).
Therefore, using num_workers
to be able to augment a batch or many batches from the iterator will be a great help (Instead of waiting 10 minutes, I would wait 15 seconds, with 40 CPUs)
Pitch
Add num_workers
to BucketIterator
.
Use num_workers
to distribute the Dataset.__getitem__
calls on all workers when iterating the BucketIterator
.
Alternatives
Writing my own implementation of a bucket iterator extending DataLoader
.
OR
Use a DataLoader
with batch size of 1
Other Info
Confirmation that BucketIterator doesn’t support num_workers
https://github.com/pytorch/text/issues/437
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:6 (3 by maintainers)
Top GitHub Comments
Perfect! That is very helpful, thank you.
So all I needed to do to make this work was write down a collator:
and replace the iterator with a
DataLoader
:Loading before an epoch went down from 17 seconds to 1-3 (on a 40 cpu server), so I’ll consider this as fixed 😃
@AmitMY The new dataset abstraction in https://github.com/pytorch/text/issues/664 and https://github.com/pytorch/text/pull/701 are compatible with
DataLoader
with multiprocessing support.