[Feature request] Compatibility with iterable-style datasets
See original GitHub issue🚀 Feature
I’d like to be able to train iterable-style datasets instead of just map-style datasets.
(a map-style dataset in PyTorch has __getitem__ and __len__, whereas iterable-style datasets only have __iter__)
Motivation
Many image datasets in commercial use cases are very large, and therefore require iterable-style rather than map-style. (Users may create custom iterable datasets, or use torchdata, webdataset, DALI, etc.)
Pitch
Vision tasks seem to require iterating over the entire dataset and building records prior to training (e.g. ObjectDetectionData). This does not make sense as a required step for large datasets. Say for example you want to compare models on a dataset of 10M images. Requiring iterating over this dataset for potentially several hours before training starts seems like an unnecessary and costly step. Users should be able to begin training online and have each sample from an iterable dataset provide the necessary information.
Lack of this capability in my opinion prevents adoption of vision tasks in this library on large scale image training in commercial settings.
Additional context
lightning-bolts object detectors seem to support this style of dataset already.
Links: https://pytorch.org/blog/efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus/ https://github.com/pytorch/data
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (2 by maintainers)

Top Related StackOverflow Question
Hi @austinmw Thanks for your request! This is a current limitation of certain tasks in Flash where they cannot be directly used with your own datamodule because the model needs to provide the collate function for the data. IceVision models are slightly more complex again in that they need to provide the dataloader in full. I think it should be possible for us to find a workaround there as this would be a great use-case to support 😃
Awesome news, can’t wait to see, thanks!