question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Draft specification of a new, extensible sequential storage format

See original GitHub issue

Motivation

Currently Lhotse offers two types of data storage:

  • random-access (HDF5, lilcom-chunky, individual files, etc.) with manifests that are fast to read and easy to extend with new fields;
  • sequential (tarfiles with WebDataset) with manifests stored together with data, which maximizes I/O but makes metadata modification painful (basically each modification requires a new copy of the whole thing)

I’d like to have a format that is both efficient to read (comparable to WebDataset tarfile efficiency) and easy to modify the metadata (like editing a JSONL file is) or extend with new pieces of metadata or custom data fields (such as other types of features, quantized codebook indexes, etc.).

Implementation

Core idea

My idea is to keep the sequential storage, but keep the manifest and the data stored in separate files. That means we could have cuts.jsonl.gz for the manifest, and cuts.tar for audio/features/etc. At runtime, we read both of them in sync and populate the cuts with in-memory raw data (e.g. compressed lilcom binary blobs for features, encoded audio data, etc.) which is decoded into numpy arrays only when we call .load_X on the cut. Lhotse would need to offer an API that reads both of these files (or a directory with a standardized file layout) to make sure iteration over metadata and data is kept in sync.

The above expressed in pseudo-code:

# variant 1:
cuts = CutSet.from_new_format("cuts.jsonl.gz", data="cuts.tar")

# variant 2:
# Expected contents of my_data_dir:
# $ ls my_data_dir/
# cuts.jsonl.gz cuts.tar
cuts = CutSet.from_new_format("my_data_dir/")

# The underlying CutSet reads manifest lines and the corresponding data sequentially in sync from the two files
# The following conditions are the same as in WebDataset based approach:
# - the data discarded by .filter is read into memory anyway
# - the shuffle operation keeps a buffer in memory that contains the actual data in addition to the metadata
for cut in cuts.filter(lambda c: c.duration >= 5).shuffle().subset(first=1000):
    audio = cut.load_audio()  # the data was already in memory, but it's an implementation detail
    ...

Modularity extension

The idea above can be extended to support multiple sequentially read tarfiles with data, each for a single data field type, e.g.:

$ ls my_data_dir/
cuts.jsonl.gz audio.tar features.tar codebook_indexes.tar ...

This is very convenient because in order to extend the sequentially stored data + metadata with a new field, we only need to add a new tar file that contains the data, and produce a new JSONL manifest that refers to that field. I expect there will be some cost to pay in terms of I/O efficiency, but I can’t really tell without testing if this is going to be a 10%, 50%, or 90% reading speed penalty.

The modular variant can also implement an API that will limit the field types being pulled into memory (or, effectively erase these fields from cuts) when the user needs only a specific set of data fields to optimize the I/O usage.

Other considerations

I am considering that maybe we don’t actually need WebDataset dependency to handle all of that – we only rely on a small subset of its features related to storage: reading/writing tarfiles, opening pipes to subprocesses, and de-duplication of data across dataloader subprocesses. I think all of those can be just ported to Lhotse for a greater control over the low-level details.


I’d love to get some thoughts on these ideas. @danpovey @csukuangfj @janvainer @Tomiinek

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:6

github_iconTop GitHub Comments

3reactions
danpoveycommented, Oct 10, 2022

For very large scale data I think it’s reasonable to limit ourselves to sequential access (and only local shuffling) and it seems to me that WebDataset would handle the I/O, and the aggregation of same-named items from different files, quite well.

A typical scenario is that someone might want to train on 50% gigaspeech, 40% librispeech, 10% switchboard (assuming they have found a way to make the transcripts compatible), and I’m thinking that each of those datasets might have been written to disk as a bunch of xxx.tar.gz files, each one containing a small part of the dataset. And perhaps different .tar.gz files for different components, e.g. lilcom features vs. audio vs. codebooks. [I’d be opening to just putting the lines of the manifest into a .tar.gz, as if each one had been in one file, so that WebDataset could handle the aggregation of that at the same time as it handles, say, features vs. codebooks].

So what I’m thinking is we’d provide a way to do the mixing of data sources, and to define what an epoch is in this scenario (if we’re to have such a notion), and also a way of saving the current state to disk, like which files we were in the middle of reading and how far through, so that we can resume when a training process needs to continue.
[… and also a way to gracefully handle situations when one of the files is unavailable or corrupted.]

For augmentation we might want to be reading Musan data in randomly too, I suppose in principle we could just cycle through that rather than doing it as random-access. We can just randomize the order of everything at the point when we are dumping it to the tar files, to make it unnecessary to do long-range randomization when we do data loading.

Rather than having multiple data-loaders read in the same data and discard some, it seems to me that that different dataloader processes could just handle different tar.gz files, if there are multiple such files, that would be more efficient. (I don’t know whether that would easily mesh with PyTorch’s concept of dataloaders though.) E.g. if we are to cycle through a dataset that’s dumped in .tar.gz files called a,b,c, and there are 2 dataloaders, we could view it as alternating in a stream of files that repeats, like a b c a b c, so process 1 takes a c b a c b… and process 2 takes b a c b a c. Now, if the files are different sizes, and num-loaders and num-files are not coprime, one dataloader could “get ahead” of the other so that some data would end up being seen more often than others. But that could in principle be handled by simply ensuring during initial data processing that each archive within a dataset has about the same number of files and they have similar properties. For Librispeech we could mix in the train-clean-100 with the train-other-500 data, for example, by specifying that we sample the train-clean-100 data 5 times less frequently than train-other-500, so that we cycle through them at about the same rate. [This treats them as if they were different datasets, as they have different properties and I’m assuming we prepared them separately and did not co-mix them into tar files.] Within each dataset though, say train-clean-100, we would have to ensure that the files are “properly mixed”.

Incidentally, the concept of an “epoch” is a little tricky when mixing multiple datasets. If epoch is defined by gigaspeech, for instance, on epoch 2 we start from the start of gigaspeech, but we may be starting from the middle of the other data sources; and the only way to know where to start is to load the state of the data-loader. But that’s OK.

I don’t know whether there might be some opportunity for simplification or introducing simpler abstractions. I think all this cut/recording/supervision stuff can potentially get confusing. And if we’re to be dealing with streams of these items (e.g. streams of Cuts) then it’s not clear to me that the concept of CutSet is so useful any more? And we could just think about things that operate on cuts rather than sets of cuts? Just a thought. I’m thinking that maybe all of this recording/cut/supervision stuff might be used more during initial processing of the dataset, and when we dump the supervised segments to disk we might be able to forget about the recording stuff? Or put any info we need into the Cut directly to avoid having to sync with a manifest of Recording stuff that may not have convenient sequential access and may not really even be needed? These are quite vague thoughts as I am not familiar with all details of the current implementation.

0reactions
danpoveycommented, Oct 13, 2022

Regarding someone calling item.features: i wasn’t imagining that this would happen in the training loop (I thought this would happen in the dataloader, and we convert to plain dict before sending to training process). I was thinking that this would happen in the dataloader, but called by some user-specified function, for, say, processing items prior to combining into batches.
Yes, IDK about how one might organize the shard-queue thing. One possibility is to have a background thread in the training process open a port for that? Gets ugly though, obviously.

Read more comments on GitHub >

github_iconTop Results From Across the Web

PixInsight — XISF
Extensible Image Serialization Format (XISF, pronounced [ɛksˈɪsf]) is the native file format of PixInsight. It is a free, open format for storage, management ......
Read more >
RFC 6283 - Extensible Markup Language Evidence Record ...
Introduction The purpose of the document is to define XML schema and processing rules for Evidence Record Syntax in XML (Extensible Markup Language)...
Read more >
Standards Portfolio | SNIA
The LTFS Format Specification defines a file system format separate from any implementation on data storage media. Using this format, data is stored...
Read more >
Issues · lhotse-speech/lhotse - GitHub
Draft specification of a new, extensible sequential storage format ... libpython3.8.so.1.0: cannot open shared object file: No such file or directory.
Read more >
Specification for the Extensible Configuration Checklist ...
This document specifies the data model and XML representation for the Extensible. Configuration Checklist Description Format (XCCDF).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found