Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

feat(BucketIterator): add num_workers

See original GitHub issue

🚀 Feature

Add num_workers to BucketIterator

Motivation

While torchtext is designed for test, it is also the best thing to use for sequence data. I have sequence data in the form of sign language poses, which falls under the category of language, and I want to batch it the same way I would batch text - sorted by length.

For sign language, the dataset needs to handle data augmentation (my specific use case) and current data augmentation libraries like imgaug and albumnations are slow (see issues https://github.com/aleju/imgaug/issues/635 and https://github.com/albumentations-team/albumentations/issues/554). Therefore, using num_workers to be able to augment a batch or many batches from the iterator will be a great help (Instead of waiting 10 minutes, I would wait 15 seconds, with 40 CPUs)

Pitch

Add num_workers to BucketIterator. Use num_workers to distribute the Dataset.__getitem__ calls on all workers when iterating the BucketIterator.

Alternatives

Writing my own implementation of a bucket iterator extending DataLoader.

Use a DataLoader with batch size of 1

Other Info

Confirmation that BucketIterator doesn’t support num_workers https://github.com/pytorch/text/issues/437

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:6 (3 by maintainers)

Top GitHub Comments

2reactions

AmitMYcommented, Apr 13, 2020

Perfect! That is very helpful, thank you.

So all I needed to do to make this work was write down a collator:

def text_data_collator(dataset: Dataset):
    def collate(data):
        batch = defaultdict(list)

        for datum in data:
            for name, field in dataset.fields.items():
                batch[name].append(field.preprocess(getattr(datum, name)))

        batch = {name: field.process(batch[name]) for name, field in dataset.fields.items()}

        return batch

    return collate

and replace the iterator with a DataLoader:

collate = text_data_collator(train_dataset)
num_workers = multiprocessing.cpu_count()

train_iter = DataLoader(train_dataset, batch_size=batch_size, collate_fn=collate, num_workers=num_workers, shuffle=True)

Loading before an epoch went down from 17 seconds to 1-3 (on a 40 cpu server), so I’ll consider this as fixed 😃

0reactions

zhangguanheng66commented, Apr 13, 2020

@AmitMY The new dataset abstraction in https://github.com/pytorch/text/issues/664 and https://github.com/pytorch/text/pull/701 are compatible with DataLoader with multiprocessing support.