question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Port `RASampler` functionality to iterable datasets

See original GitHub issue

Our current classification references use the “Repeated Augment Sampler” (RASampler) from #5051:

https://github.com/pytorch/vision/blob/d585f86d94f07a3bc083e48c6534d93a409cbcb2/references/classification/sampler.py#L7

Since after the revamp we will have iterable- rather than map-style datasets, samplers are no longer supported.

Given that the RASampler increases accuracy, we need to support the same functionality going forward. It can probably be achieved by appending a custom RepeatedAugmentIterDataPipe to the dataset graph, but we need to make sure it works correctly with the shuffling and sharding.

cc @pmeier @YosuaMichael @datumbox @vfdev-5 @bjuncek

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:1
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
pmeiercommented, May 16, 2022

One thing that came to mind thinking about this is that we will be wasting quite some I/O resources when using this functionality. The prototype datasets will read the raw bytes and store them in a tensor for later decoding. Since we will throw away the majority of the samples using this technique, we wasted all these reads. I don’t know if that is significant since we only read and not decode, but we should watch out for this.

0reactions
datumboxcommented, May 16, 2022

I see what you mean. You mean that we decode the same thing over and over. That’s expected and I agree it was happening on the previous implementation.

For a moment I thought you meant that the new API would require us decoding the part of the dataset that we throw-away, which would obviously be a problem. Sorry for the confusion.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Sampler for IterableDataset · Issue #28743 · pytorch ... - GitHub
Neither sampler nor batch_sampler is compatible with iterable-style datasets, since such datasets have no notion of a key or an index.
Read more >
Iterable datasets features - Hugging Face Forums
I'm using dataset with streaming=True and I see the dataset features are None . It's a expected behaviour? import datasets ds = datasets....
Read more >
torch.utils.data.dataloader 源代码 - Neural Network Intelligence
DataLoader` supports both map-style and iterable-style datasets with single- or multi-process loading, customizing loading order and optional automatic ...
Read more >
torch source: R/utils-data-dataloader.R - Rdrr.io
R defines the following functions: use_socket_con warn_tensor ... length.dataloader <- function(x) { x$.length() } #' Creates an iterator from a DataLoader ...
Read more >
Iterable-style DataPipes — TorchData main documentation
An iterable-style dataset is an instance of a subclass of IterableDataset that ... Mapping - apply the a given function to each element...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found