Support for sharded datasets (e.g WebDataset API) for efficiently working on network-attached filesystems
See original GitHub issueFirst of all: thanks for making such an amazing tool public!
I was interested in using the VoxCeleb speaker recognition recipe as a baseline/starting point for a research project. However, a major issue is how the data is stored on disk. If I understand correctly SpeechBrain follows the pattern that you have a CSV/JSON manifest, wherein each training sample has a path to a (e.g audio) file on a filesystem mounted on the machine.
In the case of VoxCeleb1 and 2, this amounts to ~1.200.000 files in a single folder. On a local NVMe SSD this is fine, but our cluster uses a network-attached filesystem for providing most storage space. This has severe performance implications - loading ~100.000 audio files (a single epoch of VoxCeleb1) with a network-attached filesystem takes an hour. There is a fairly straightforward solution to this problem: storing your data in so-called shards (e.g TFRecords in Tensorflow). In PyTorch there is a current proposal [1] to include the WebDataset [2, 3] library as an API for sharded datasets. The basic premise of this library is storing your data in multiple .tar
or .tar.gz
files. In my case, this reduced the loading time of a single VoxCeleb1 epoch from an hour to 3 minutes.
I wanted to discuss a potential integration of WebDatasets in SpeechBrain. Currenty, speechbrain.dataio.datasets.DynamicItemDataset
inherits from torch.util.data.Dataset
, while the WebDataset
class uses the torch.util.data.IterableDataset
API. I am not yet familiar enough with the code so I had the following question(s):
- Is there an existing solution for efficiently using speechbrain on network-attached file systems?
- If not, would there be interested in merging a PR offering support for
WebDataset
? - Any suggestions for the best way to integrate
WebDataset
?
links: [1] https://github.com/pytorch/pytorch/issues/38419 [2] https://github.com/tmbdev/webdataset [3] https://pytorch.org/blog/efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus/
Issue Analytics
- State:
- Created 3 years ago
- Comments:11
Top GitHub Comments
Thanks.
WebDataset just delivers training samples sequentially, and it does so exactly throughout its entire pipeline. The main difficulty is that PyTorch’s DDP requires an exactly equal number of batches in all workers right now, and that’s a limitation you need to work around, no matter what loader you use. This isn’t something the loader can fix, since samples, batches, and nodes just don’t divide evenly in general.
You can get as close to a solution as indexed datasets by using Tensorcom; with Tensorcom, all the batches are aggregated from all the loaders before being redistributed to the GPUs. But you’re still left with a partial set of batches for end of the epoch, just like with indexed datasets.
A full solution necessarily requires some kind of inter-node communications. One idea is to train with zero length batches in all the nodes that have run out of data and have DDP signal when all nodes have run out of data to end the epoch. It would be a nice little project outside WebDataset and would benefit all loaders.
I’ve never really understood the purpose of going through that trouble though. The only reason for one-sample-per-epoch has been the relationship between full gradient descent and SGD, but training doesn’t approximate full gradient descent anyway (not even close). Furthermore, if you drop the one sample per epoch requirement, you’re basically just training with samplinig-with-replacement, which is statistically a perfectly good approach (arguably, better than using each sample exactly once). I generally prefer “# samples seen” as a consistent and easy to compare metric in all experiments.
(Note that the two issues, namely determinism and exact epochs, are separate.)
Just FYI: I’m working on a sharded version (with Webdataset) of the VoxCeleb recipe here: https://github.com/nikvaessen/speechbrain/tree/sharded-voxceleb/my-recipes/SpeakerRec
I’ll give an update when it’s in a usable state 😃