Port `RASampler` functionality to iterable datasets
See original GitHub issueOur current classification references use the “Repeated Augment Sampler” (RASampler
) from #5051:
Since after the revamp we will have iterable- rather than map-style datasets, samplers are no longer supported.
Given that the RASampler
increases accuracy, we need to support the same functionality going forward. It can probably be achieved by appending a custom RepeatedAugmentIterDataPipe
to the dataset graph, but we need to make sure it works correctly with the shuffling and sharding.
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:7 (6 by maintainers)
Top Results From Across the Web
Sampler for IterableDataset · Issue #28743 · pytorch ... - GitHub
Neither sampler nor batch_sampler is compatible with iterable-style datasets, since such datasets have no notion of a key or an index.
Read more >Iterable datasets features - Hugging Face Forums
I'm using dataset with streaming=True and I see the dataset features are None . It's a expected behaviour? import datasets ds = datasets....
Read more >torch.utils.data.dataloader 源代码 - Neural Network Intelligence
DataLoader` supports both map-style and iterable-style datasets with single- or multi-process loading, customizing loading order and optional automatic ...
Read more >torch source: R/utils-data-dataloader.R - Rdrr.io
R defines the following functions: use_socket_con warn_tensor ... length.dataloader <- function(x) { x$.length() } #' Creates an iterator from a DataLoader ...
Read more >Iterable-style DataPipes — TorchData main documentation
An iterable-style dataset is an instance of a subclass of IterableDataset that ... Mapping - apply the a given function to each element...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
One thing that came to mind thinking about this is that we will be wasting quite some I/O resources when using this functionality. The prototype datasets will read the raw bytes and store them in a tensor for later decoding. Since we will throw away the majority of the samples using this technique, we wasted all these reads. I don’t know if that is significant since we only read and not decode, but we should watch out for this.
I see what you mean. You mean that we decode the same thing over and over. That’s expected and I agree it was happening on the previous implementation.
For a moment I thought you meant that the new API would require us decoding the part of the dataset that we throw-away, which would obviously be a problem. Sorry for the confusion.