Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support for sharded datasets (e.g WebDataset API) for efficiently working on network-attached filesystems

See original GitHub issue

First of all: thanks for making such an amazing tool public!

I was interested in using the VoxCeleb speaker recognition recipe as a baseline/starting point for a research project. However, a major issue is how the data is stored on disk. If I understand correctly SpeechBrain follows the pattern that you have a CSV/JSON manifest, wherein each training sample has a path to a (e.g audio) file on a filesystem mounted on the machine.

In the case of VoxCeleb1 and 2, this amounts to ~1.200.000 files in a single folder. On a local NVMe SSD this is fine, but our cluster uses a network-attached filesystem for providing most storage space. This has severe performance implications - loading ~100.000 audio files (a single epoch of VoxCeleb1) with a network-attached filesystem takes an hour. There is a fairly straightforward solution to this problem: storing your data in so-called shards (e.g TFRecords in Tensorflow). In PyTorch there is a current proposal [1] to include the WebDataset [2, 3] library as an API for sharded datasets. The basic premise of this library is storing your data in multiple .tar or .tar.gz files. In my case, this reduced the loading time of a single VoxCeleb1 epoch from an hour to 3 minutes.

I wanted to discuss a potential integration of WebDatasets in SpeechBrain. Currenty, speechbrain.dataio.datasets.DynamicItemDataset inherits from torch.util.data.Dataset, while the WebDataset class uses the torch.util.data.IterableDataset API. I am not yet familiar enough with the code so I had the following question(s):

Is there an existing solution for efficiently using speechbrain on network-attached file systems?
If not, would there be interested in merging a PR offering support for WebDataset?
Any suggestions for the best way to integrate WebDataset?

links: [1] https://github.com/pytorch/pytorch/issues/38419 [2] https://github.com/tmbdev/webdataset [3] https://pytorch.org/blog/efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus/

Issue Analytics

State:
Created 3 years ago
Comments:11

Top GitHub Comments

2reactions

tmbdevcommented, Apr 28, 2021

Thanks.

WebDataset just delivers training samples sequentially, and it does so exactly throughout its entire pipeline. The main difficulty is that PyTorch’s DDP requires an exactly equal number of batches in all workers right now, and that’s a limitation you need to work around, no matter what loader you use. This isn’t something the loader can fix, since samples, batches, and nodes just don’t divide evenly in general.

You can get as close to a solution as indexed datasets by using Tensorcom; with Tensorcom, all the batches are aggregated from all the loaders before being redistributed to the GPUs. But you’re still left with a partial set of batches for end of the epoch, just like with indexed datasets.

A full solution necessarily requires some kind of inter-node communications. One idea is to train with zero length batches in all the nodes that have run out of data and have DDP signal when all nodes have run out of data to end the epoch. It would be a nice little project outside WebDataset and would benefit all loaders.

I’ve never really understood the purpose of going through that trouble though. The only reason for one-sample-per-epoch has been the relationship between full gradient descent and SGD, but training doesn’t approximate full gradient descent anyway (not even close). Furthermore, if you drop the one sample per epoch requirement, you’re basically just training with samplinig-with-replacement, which is statistically a perfectly good approach (arguably, better than using each sample exactly once). I generally prefer “# samples seen” as a consistent and easy to compare metric in all experiments.

(Note that the two issues, namely determinism and exact epochs, are separate.)

2reactions

nikvaessencommented, Mar 19, 2021

Just FYI: I’m working on a sharded version (with Webdataset) of the VoxCeleb recipe here: https://github.com/nikvaessen/speechbrain/tree/sharded-voxceleb/my-recipes/SpeakerRec

I’ll give an update when it’s in a usable state 😃

Top Results From Across the Web

webdataset.dataset API documentation

This is a convenience function for constructing a partial pipeline that reads from a set of sharded tar files, extracts the individual files,...

High Performance I/O For Large Scale Deep Learning

In this paper, we describe AIStore, a highly scalable, easy-to-deploy storage system, and WebDataset, a standards-based storage format and library that permits ...

Efficient PyTorch I/O library for Large Datasets, Many Files ...

The WebDataset library is a complete solution for working with large datasets and distributed training in PyTorch (and also works with ...

Organizing, Indexing, and Searching Large-Scale File Systems

about work and all with unwavering love and support. ... Early scalable file systems used a basic network-attached storage model where many.

webdataset PyTorch Model - Model Zoo

WebDataset is a PyTorch Dataset (IterableDataset) implementation providing efficient access to datasets stored in POSIX tar archives.