Map-style vs Iterable-style Dataset
See original GitHub issueI want to start a discussion to determine what’s the right way to go forward with data loading efficiency and flexbility for Lhotse PyTorch datasets.
PyTorch documentation for dataset API describes two types of datasets - map-style and iterable-style. Short summary:
Map-style:
for index in sampler:
sample = dataset[index]
- loads a single item at a time
DataLoader
samples indices of the items to be loaded (using aSampler
) based onlen()
of the datasetDataLoader
specifies a fixed batch size and takes care of collation (possibly with a customcollate_fn
)
Iterable-style:
for batch in iter(dataset): # batch/sample depending on the implementation
pass
- may load multiple items at the same time
- has to care of collation, shuffling, batch size (can be dynamic), etc. itself
- does not have to specify
len()
Comments
Our current examples are based on the map-style paradigm, which offloads more things to the DataLoader default settings. I’m wondering if we should explore the iterable-style dataset more going forward. It seems like it might make it easier to work with sharded data, or support non-random-access reading (e.g. block-random sampling for batches). I think it’d be good to start collecting some insights, experiences and requirements to make sure the design is solid.
Another question is should Lhotse concern itself with that, and if so, to what extent? I don’t think we’ll be able to provide an optimal solution for every use case, but I think it’d be good to provide an option that (perhaps with some tuning) is at least okay in most compute infrastructures. We’ll also gain more insights regarding that from building and running K2 recipes.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6
Top GitHub Comments
Mm. Some datasets will have drastically different lengths of utterance. If we don’t at least try to do something smart on this (e.g. combine multiple short utterance into a single utterance, with silence in between so the model can know they’re separate), it will be very inefficient.
On Thu, Nov 5, 2020 at 10:17 PM Piotr Żelasko notifications@github.com wrote:
I think after the latest refactoring of all datasets as map-style datasets with cut-aware samplers (#202), this issue can be closed.