Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Map-style vs Iterable-style Dataset

See original GitHub issue

I want to start a discussion to determine what’s the right way to go forward with data loading efficiency and flexbility for Lhotse PyTorch datasets.

PyTorch documentation for dataset API describes two types of datasets - map-style and iterable-style. Short summary:

Map-style:

for index in sampler:
    sample = dataset[index]

loads a single item at a time
DataLoader samples indices of the items to be loaded (using a Sampler) based on len() of the dataset
DataLoader specifies a fixed batch size and takes care of collation (possibly with a custom collate_fn)

Iterable-style:

for batch in iter(dataset):  # batch/sample depending on the implementation
    pass

may load multiple items at the same time
has to care of collation, shuffling, batch size (can be dynamic), etc. itself
does not have to specify len()

Comments

Our current examples are based on the map-style paradigm, which offloads more things to the DataLoader default settings. I’m wondering if we should explore the iterable-style dataset more going forward. It seems like it might make it easier to work with sharded data, or support non-random-access reading (e.g. block-random sampling for batches). I think it’d be good to start collecting some insights, experiences and requirements to make sure the design is solid.

Another question is should Lhotse concern itself with that, and if so, to what extent? I don’t think we’ll be able to provide an optimal solution for every use case, but I think it’d be good to provide an option that (perhaps with some tuning) is at least okay in most compute infrastructures. We’ll also gain more insights regarding that from building and running K2 recipes.

Issue Analytics

State:
Created 3 years ago
Comments:6

Top GitHub Comments

1reaction

danpoveycommented, Nov 6, 2020

Mm. Some datasets will have drastically different lengths of utterance. If we don’t at least try to do something smart on this (e.g. combine multiple short utterance into a single utterance, with silence in between so the model can know they’re separate), it will be very inefficient.

On Thu, Nov 5, 2020 at 10:17 PM Piotr Żelasko notifications@github.com wrote:

A related topic that came up in a separate discussion: Padding

It’s one of the reasons I’m looking at iterable datasets - they would allow us to select N cuts and call pad() on just them as opposed to the whole set. Then we could also use e.g. noise for padding instead of silence/low energies. We could easily adjust the batch size dynamically based on the total duration constraints.

The alternative is to sort all cuts by duration, split, pad in splits, combine, and implement a Sampler that picks same length samples. It’s maybe not ideal as some batches will be smaller and others larger.

The easiest to implement option is to perform padding in collate_fn, but it limits our flexibility in choosing what to pad with, as we’re no longer dealing with Cuts at that stage. It also won’t help us with dynamic batch size.

I’ll prepare an example of an iterable K2 dataset in the next few days.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/113#issuecomment-722405785, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7JJV466XG6P5BJ36TSOKXWBANCNFSM4TKPDUOA .

0reactions

pzelaskocommented, Feb 26, 2021

I think after the latest refactoring of all datasets as map-style datasets with cut-aware samplers (#202), this issue can be closed.

Top Results From Across the Web

pytorch dataset map-style vs iterable-style - Stack Overflow

I wrote a short post on how to use PyTorch datasets, and the difference between map-style and iterable-style dataset.

Map-style dataset and iterable-style dataset - Medium

An iterable-style dataset is an instance of a subclass of IterableDataset that implements the __iter__() protocol, and represents an iterable over data samples....

Pytorch Map-style Dataset for 800+GB Text Data - Wandb

Pytorch offers two kinds of Dataset implementations. Map-style and Iterable-style. Refer to CODE-BLOCK-1 for a quick summary and Pytorch documentation for ...

Load a Dataset in Streaming mode - Hugging Face

Chief Napoleon conveyed the desperate need for a program to intervene and care for the orphans and ... IterableDataset , not the classic...

Datasets And Dataloaders in Pytorch - GeeksforGeeks

iterable-style datasets: Datasets that can be represented in a set of iterable data samples, for this we use __iter__( )function. Dataloader on ...