question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature request] Compatibility with iterable-style datasets

See original GitHub issue

🚀 Feature

I’d like to be able to train iterable-style datasets instead of just map-style datasets. (a map-style dataset in PyTorch has __getitem__ and __len__, whereas iterable-style datasets only have __iter__)

Motivation

Many image datasets in commercial use cases are very large, and therefore require iterable-style rather than map-style. (Users may create custom iterable datasets, or use torchdata, webdataset, DALI, etc.)

Pitch

Vision tasks seem to require iterating over the entire dataset and building records prior to training (e.g. ObjectDetectionData). This does not make sense as a required step for large datasets. Say for example you want to compare models on a dataset of 10M images. Requiring iterating over this dataset for potentially several hours before training starts seems like an unnecessary and costly step. Users should be able to begin training online and have each sample from an iterable dataset provide the necessary information.

Lack of this capability in my opinion prevents adoption of vision tasks in this library on large scale image training in commercial settings.

Additional context

lightning-bolts object detectors seem to support this style of dataset already.

Links: https://pytorch.org/blog/efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus/ https://github.com/pytorch/data

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
ethanwharriscommented, Mar 21, 2022

Hi @austinmw Thanks for your request! This is a current limitation of certain tasks in Flash where they cannot be directly used with your own datamodule because the model needs to provide the collate function for the data. IceVision models are slightly more complex again in that they need to provide the dataloader in full. I think it should be possible for us to find a workaround there as this would be a great use-case to support 😃

0reactions
austinmwcommented, Jun 29, 2022

Awesome news, can’t wait to see, thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

torch.utils.data — PyTorch 1.13 documentation
An iterable-style dataset is an instance of a subclass of IterableDataset that implements the __iter__() protocol, and represents an iterable over data samples....
Read more >
Main classes - Hugging Face
We're on a journey to advance and democratize artificial intelligence through open source and open science.
Read more >
Fault-tolerant Training (FAQ) - PyTorch Lightning
Fault-tolerant Training (FAQ). How do I use iterable datasets? To support fault-tolerance, you will need to use and expose a sampler within your...
Read more >
PyTorch DataLoader Quick Start - Sparrow Computing
There are two styles of Dataset class, map-style and iterable-style. Map-style Datasets are more common and more straightforward so we'll ...
Read more >
Datasets — h5py 3.7.0 documentation
Unlike NumPy arrays, they support a variety of transparent storage features such as compression, error-detection, and chunked I/O.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found