Improve tests for Dataset
See original GitHub issuetorchaudio
had minimalistic test for dataset implementations. See here
Recently we have improved our test utilities and now we can generate synthetic data which emulates a subset of dataset. See examples YesNo and GTZAN
We would like to do the same for the remaining datasets
- VCTK
- LibriSpeech
- LJSpeech
- SpeechCommands
- CMUArctic
- CommonVoice
General Direction
-
Check the dataset of interest and pick subset of files, (check their naming conventions, sampling rate and number of channels)
-
Follow the approach of existing test module, create a new test module
test/datasets/XXX_test.py
and define your test class. -
Generate pseudo dataset in
setUpClass
method. Create a list of expected data. -
Traverse the directory with Dataset implementation
-
Check that files are traversed in the expected order, then loaded data match.
-
Check that Dataset traversed the expected number of files.
-
If the dataset has multiple operational modes, like subset in GTZAN also add these as test methods.
-
Once the new test is added, remove the original test and
the associated assets.
test/assets/ARCTIC/cmu_us_aew_arctic/etc/txt.done.data test/assets/ARCTIC/cmu_us_aew_arctic/wav/arctic_a0024.wav test/assets/CommonVoice/cv-corpus-4-2019-12-10/tt/clips/common_voice_tt_00000000.wav test/assets/CommonVoice/cv-corpus-4-2019-12-10/tt/train.tsv test/assets/LJSpeech-1.1/metadata.csv test/assets/LJSpeech-1.1/wavs/LJ001-0001.wav test/assets/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac test/assets/LibriSpeech/dev-clean/1272/128104/1272-128104.trans.txt test/assets/SpeechCommands/speech_commands_v0.02/go/0a9f9af7_nohash_0.wav test/assets/VCTK-Corpus/txt/p224/p224_002.txt test/assets/VCTK-Corpus/wav48/p224/p224_002.wav
-
Once the PR is ready add @mthrok as reviewer.
Note
- It is highly recommended to use Anaconda
- Please use nightly build of PyTorch. https://pytorch.org/
- You can run test with
pytest test/datasets/XXX_test.py
. - PR example #819
- For the simplicity, please use
wav
format when saving synthetic data (save_wav
) even if the reference dataset uses other format. (decoding formats likemp3
adds complexity to test logic, which we are trying to avoid) - When saving wave data with
save_wav
, thedtype
of the Tensor makes difference. If the reference dataset uses WAV format, use the same bit depth (likeint16
). If the reference dataset uses compressed format, likemp3
orflac
, usefloat32
wav. - Data loaded with Dataset implementation typical has normalized (values in
[-1.0, 1.0]
),float32
type. (which is whynormalize_wav
is used to generate reference data in the examples above.)
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (12 by maintainers)
Top GitHub Comments
I’ll take a look at CommonVoice dataset.
I’ll have a go at CMUArctic