Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve tests for Dataset

See original GitHub issue

torchaudio had minimalistic test for dataset implementations. See here

Recently we have improved our test utilities and now we can generate synthetic data which emulates a subset of dataset. See examples YesNo and GTZAN

We would like to do the same for the remaining datasets

VCTK
LibriSpeech
LJSpeech
SpeechCommands
CMUArctic
CommonVoice

General Direction

Check the dataset of interest and pick subset of files, (check their naming conventions, sampling rate and number of channels)
Follow the approach of existing test module, create a new test module test/datasets/XXX_test.py and define your test class.
Generate pseudo dataset in setUpClass method. Create a list of expected data.
Traverse the directory with Dataset implementation
Check that files are traversed in the expected order, then loaded data match.
Check that Dataset traversed the expected number of files.
If the dataset has multiple operational modes, like subset in GTZAN also add these as test methods.

Once the new test is added, remove the original test and

the associated assets.

test/assets/ARCTIC/cmu_us_aew_arctic/etc/txt.done.data
test/assets/ARCTIC/cmu_us_aew_arctic/wav/arctic_a0024.wav
test/assets/CommonVoice/cv-corpus-4-2019-12-10/tt/clips/common_voice_tt_00000000.wav
test/assets/CommonVoice/cv-corpus-4-2019-12-10/tt/train.tsv
test/assets/LJSpeech-1.1/metadata.csv
test/assets/LJSpeech-1.1/wavs/LJ001-0001.wav
test/assets/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac
test/assets/LibriSpeech/dev-clean/1272/128104/1272-128104.trans.txt
test/assets/SpeechCommands/speech_commands_v0.02/go/0a9f9af7_nohash_0.wav
test/assets/VCTK-Corpus/txt/p224/p224_002.txt
test/assets/VCTK-Corpus/wav48/p224/p224_002.wav

Once the PR is ready add @mthrok as reviewer.

Note

It is highly recommended to use Anaconda
Please use nightly build of PyTorch. https://pytorch.org/
You can run test with pytest test/datasets/XXX_test.py.
PR example #819
For the simplicity, please use wav format when saving synthetic data (save_wav) even if the reference dataset uses other format. (decoding formats like mp3 adds complexity to test logic, which we are trying to avoid)
When saving wave data with save_wav, the dtype of the Tensor makes difference. If the reference dataset uses WAV format, use the same bit depth (like int16). If the reference dataset uses compressed format, like mp3 or flac, use float32 wav.
Data loaded with Dataset implementation typical has normalized (values in [-1.0, 1.0]), float32 type. (which is why normalize_wav is used to generate reference data in the examples above.)