question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve tests for Dataset

See original GitHub issue

torchaudio had minimalistic test for dataset implementations. See here

Recently we have improved our test utilities and now we can generate synthetic data which emulates a subset of dataset. See examples YesNo and GTZAN

We would like to do the same for the remaining datasets

  • VCTK
  • LibriSpeech
  • LJSpeech
  • SpeechCommands
  • CMUArctic
  • CommonVoice

General Direction

  1. Check the dataset of interest and pick subset of files, (check their naming conventions, sampling rate and number of channels)

  2. Follow the approach of existing test module, create a new test module test/datasets/XXX_test.py and define your test class.

  3. Generate pseudo dataset in setUpClass method. Create a list of expected data.

  4. Traverse the directory with Dataset implementation

  5. Check that files are traversed in the expected order, then loaded data match.

  6. Check that Dataset traversed the expected number of files.

  7. If the dataset has multiple operational modes, like subset in GTZAN also add these as test methods.

  8. Once the new test is added, remove the original test and

    the associated assets.

    test/assets/ARCTIC/cmu_us_aew_arctic/etc/txt.done.data
    test/assets/ARCTIC/cmu_us_aew_arctic/wav/arctic_a0024.wav
    test/assets/CommonVoice/cv-corpus-4-2019-12-10/tt/clips/common_voice_tt_00000000.wav
    test/assets/CommonVoice/cv-corpus-4-2019-12-10/tt/train.tsv
    test/assets/LJSpeech-1.1/metadata.csv
    test/assets/LJSpeech-1.1/wavs/LJ001-0001.wav
    test/assets/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac
    test/assets/LibriSpeech/dev-clean/1272/128104/1272-128104.trans.txt
    test/assets/SpeechCommands/speech_commands_v0.02/go/0a9f9af7_nohash_0.wav
    test/assets/VCTK-Corpus/txt/p224/p224_002.txt
    test/assets/VCTK-Corpus/wav48/p224/p224_002.wav
    
  9. Once the PR is ready add @mthrok as reviewer.

Note

  • It is highly recommended to use Anaconda
  • Please use nightly build of PyTorch. https://pytorch.org/
  • You can run test with pytest test/datasets/XXX_test.py.
  • PR example #819
  • For the simplicity, please use wav format when saving synthetic data (save_wav) even if the reference dataset uses other format. (decoding formats like mp3 adds complexity to test logic, which we are trying to avoid)
  • When saving wave data with save_wav, the dtype of the Tensor makes difference. If the reference dataset uses WAV format, use the same bit depth (like int16). If the reference dataset uses compressed format, like mp3 or flac, use float32 wav.
  • Data loaded with Dataset implementation typical has normalized (values in [-1.0, 1.0]), float32 type. (which is why normalize_wav is used to generate reference data in the examples above.)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
YazhiGaocommented, Jul 23, 2020

I’ll take a look at CommonVoice dataset.

1reaction
suraj813commented, Jul 23, 2020

I’ll have a go at CMUArctic

Read more comments on GitHub >

github_iconTop Results From Across the Web

Test Data Is Critical: How to Best Generate, Manage, and Use It
You can generate test data in various ways - through scraping, from production data, or user interactions. How to best generate and use...
Read more >
What is Test Data? Test Data Preparation Techniques with ...
Learn what is test data and how to prepare test data for testing using different data preparation techniques with examples.
Read more >
9 Ways To Make Slow Tests Faster - Semaphore CI
At Semaphore, we've seen our fair share of tests and have identified 9 ways to make your slow tests faster.
Read more >
Unit Testing for Data Scientists - Towards Data Science
Writing tests for specific modules improves the stability of your code and makes mistakes easier to spot. Especially when working on large ...
Read more >
Improve your Software Quality Process with Test Data ...
With the help of built-in masking rules and synthetic data generators you create a test dataset with reliable and recognizable data that ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found