[Feature request]: Add formater for MLS dataset
See original GitHub issue🚀 Feature Description
I’m sick of listening to the only french model we have now (Tacotron2-DDC+MelGAN) so I started training on M-AILABS-fr but there is not enough data to train from scratch really…
Solution
Use Multilingual LibriSpeech (MLS) so not only french gains more data to play with but also all those languages:
- English (audio: 2.4T)
- German (audio: 115G)
- Dutch (audio: 86G)
- French (audio: 61G)
- Spanish (audio: 50G)
- Italian (audio: 15G)
- Portuguese (audio: 9.3G)
- Polish (audio: 6.2G)
https://arxiv.org/pdf/2012.03411.pdf
MLS is licensed under the CC BY 4.0 and is derived from read audiobooks from LibriVox and is available in 8 languages. I find it quite similar to M-AILABS in terms of quality. (not good but passable)
It comes in two folders mls_${lang}_opus
(with audio in opus format) and mls_lm_${lang}
(for the transcription - we don’t really need it for TTS -).
mls_${lang}_opus
has some metadata on the speaker:
SPEAKER | GENDER | PARTITION | MINUTES | BOOK ID | TITLE | CHAPTER
10065 | M | train | 9.002 | 10039 | Saint Évangile selon Saint Marc | Chapitre 09
Segmented sentences…
1406_1028_000000 http://www.archive.org/download/les1001nuits_tome1_0711_librivox/1001nuits1_010_galland_64kb.mp3 210.62 227.11
… and their transcript
1406_1028_000000 pendant le second siècle je fis serment d'ouvrir tous les trésors de la terre à quiconque me mettrait en liberté mais je ne fus pas plus heureux dans le troisième je promis de faire puissant monarque mon libérateur d'être toujours près de lui en esprit
Opus files are located in audio/*/*/*.opus
(for e.g. sentence 1406_1028_000000
opus is located at audio/1406/1028/1406_1028_000000.opus
)
❯ opusinfo audio/1406/1028/1406_1028_000000.opus
Processing file "audio/1406/1028/1406_1028_000000.opus"...
New logical stream (#1, serial: 46a945e0): type opus
Encoded with libopus 1.1.2
User comments section follows...
ENCODER=opusenc from opus-tools 0.1.10
ENCODER_OPTIONS=--quiet
artist=Antoine Galland
title=010 - 10eme nuit
album=Les mille et une nuits, tome 1
encoder=Lavf57.83.100
Opus stream 1:
Pre-skip: 312
Playback gain: 0 dB
Channels: 1
Original sample rate: 16000 Hz
Packet duration: 20.0ms (max), 20.0ms (avg), 20.0ms (min)
Page duration: 1000.0ms (max), 970.6ms (avg), 500.0ms (min)
Total data length: 71137 bytes (overhead: 2.99%)
Playback length: 0m:16.489s
Average bitrate: 34.51 kbit/s, w/o overhead: 33.48 kbit/s
Logical stream 1 ended
Alternative Solutions
We can always higher professional actors and actresses to say stuff in a recording studio but that’s a lot more expansive.
Additional context
Overall it’s not a perfect solution but I’ll gladly take the added data.
It’s going to be tricky though.
- Because it’s multi-lingual (we have the formater work for them all)
- Because it’s in the opus format
- The sentences are maybe too long to train so we have to control the max length of each file.
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:6 (1 by maintainers)
Top GitHub Comments
There is already a formatter for MLS, you just have to convert the data to wav. https://github.com/coqui-ai/TTS/blob/c44e39d9d6bfeea15c6e600c6167663c0f9196ea/TTS/tts/datasets/formatters.py#L421
I’m closing this but you can continue discussing here 😃
That makes a big difference!