Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Port `RASampler` functionality to iterable datasets

See original GitHub issue

Our current classification references use the “Repeated Augment Sampler” (RASampler) from #5051:

https://github.com/pytorch/vision/blob/d585f86d94f07a3bc083e48c6534d93a409cbcb2/references/classification/sampler.py#L7

Since after the revamp we will have iterable- rather than map-style datasets, samplers are no longer supported.

Given that the RASampler increases accuracy, we need to support the same functionality going forward. It can probably be achieved by appending a custom RepeatedAugmentIterDataPipe to the dataset graph, but we need to make sure it works correctly with the shuffling and sharding.

cc @pmeier @YosuaMichael @datumbox @vfdev-5 @bjuncek

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:7 (6 by maintainers)

Top GitHub Comments

1reaction

pmeiercommented, May 16, 2022

One thing that came to mind thinking about this is that we will be wasting quite some I/O resources when using this functionality. The prototype datasets will read the raw bytes and store them in a tensor for later decoding. Since we will throw away the majority of the samples using this technique, we wasted all these reads. I don’t know if that is significant since we only read and not decode, but we should watch out for this.

0reactions

datumboxcommented, May 16, 2022

I see what you mean. You mean that we decode the same thing over and over. That’s expected and I agree it was happening on the previous implementation.

For a moment I thought you meant that the new API would require us decoding the part of the dataset that we throw-away, which would obviously be a problem. Sorry for the confusion.

Top Results From Across the Web

Sampler for IterableDataset · Issue #28743 · pytorch ... - GitHub

Neither sampler nor batch_sampler is compatible with iterable-style datasets, since such datasets have no notion of a key or an index.

Iterable datasets features - Hugging Face Forums

I'm using dataset with streaming=True and I see the dataset features are None . It's a expected behaviour? import datasets ds = datasets....

torch.utils.data.dataloader 源代码 - Neural Network Intelligence

DataLoader` supports both map-style and iterable-style datasets with single- or multi-process loading, customizing loading order and optional automatic ...

torch source: R/utils-data-dataloader.R - Rdrr.io

R defines the following functions: use_socket_con warn_tensor ... length.dataloader <- function(x) { x$.length() } #' Creates an iterator from a DataLoader ...

Iterable-style DataPipes — TorchData main documentation

An iterable-style dataset is an instance of a subclass of IterableDataset that ... Mapping - apply the a given function to each element...