question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

feat(BucketIterator): add num_workers

See original GitHub issue

🚀 Feature

Add num_workers to BucketIterator

Motivation

While torchtext is designed for test, it is also the best thing to use for sequence data. I have sequence data in the form of sign language poses, which falls under the category of language, and I want to batch it the same way I would batch text - sorted by length.

For sign language, the dataset needs to handle data augmentation (my specific use case) and current data augmentation libraries like imgaug and albumnations are slow (see issues https://github.com/aleju/imgaug/issues/635 and https://github.com/albumentations-team/albumentations/issues/554). Therefore, using num_workers to be able to augment a batch or many batches from the iterator will be a great help (Instead of waiting 10 minutes, I would wait 15 seconds, with 40 CPUs)

Pitch

Add num_workers to BucketIterator. Use num_workers to distribute the Dataset.__getitem__ calls on all workers when iterating the BucketIterator.

Alternatives

Writing my own implementation of a bucket iterator extending DataLoader.

OR

Use a DataLoader with batch size of 1

Other Info

Confirmation that BucketIterator doesn’t support num_workers https://github.com/pytorch/text/issues/437

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
AmitMYcommented, Apr 13, 2020

Perfect! That is very helpful, thank you.

So all I needed to do to make this work was write down a collator:

def text_data_collator(dataset: Dataset):
    def collate(data):
        batch = defaultdict(list)

        for datum in data:
            for name, field in dataset.fields.items():
                batch[name].append(field.preprocess(getattr(datum, name)))

        batch = {name: field.process(batch[name]) for name, field in dataset.fields.items()}

        return batch

    return collate

and replace the iterator with a DataLoader:

collate = text_data_collator(train_dataset)
num_workers = multiprocessing.cpu_count()

train_iter = DataLoader(train_dataset, batch_size=batch_size, collate_fn=collate, num_workers=num_workers, shuffle=True)

Loading before an epoch went down from 17 seconds to 1-3 (on a 40 cpu server), so I’ll consider this as fixed 😃

0reactions
zhangguanheng66commented, Apr 13, 2020

@AmitMY The new dataset abstraction in https://github.com/pytorch/text/issues/664 and https://github.com/pytorch/text/pull/701 are compatible with DataLoader with multiprocessing support.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Better Batches with PyTorchText BucketIterator | Medium
The purpose is to use an example text datasets and batch it using PyTorchText with BucketIterator and show how it groups text sequences...
Read more >
PyTorchText BucketIterator - George Mihaila
The purpose is to use an example text datasets and batch it using PyTorchText with BucketIterator and show how it groups text sequences...
Read more >
torchtext.data - Read the Docs
Default is '.data'. train (str) – Suffix to add to path for the train set, ... BucketIterator (dataset, batch_size, sort_key=None, device=None, ...
Read more >
How to use the torchtext.data.BucketIterator function in ... - Snyk
BucketIterator function in torchtext. To help you get started, we've selected a few torchtext examples, based on popular ways it is used in...
Read more >
How can I add a feature using torchtext? - Stack Overflow
BucketIterator.splits((train_data,valid_data,),batch_size=2,device=device ... a.feat.shape, a.text[0].shape # printing the shape. (torch.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found