Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature request]: Add formater for MLS dataset

See original GitHub issue

🚀 Feature Description

I’m sick of listening to the only french model we have now (Tacotron2-DDC+MelGAN) so I started training on M-AILABS-fr but there is not enough data to train from scratch really…

Solution

Use Multilingual LibriSpeech (MLS) so not only french gains more data to play with but also all those languages:

English (audio: 2.4T)
German (audio: 115G)
Dutch (audio: 86G)
French (audio: 61G)
Spanish (audio: 50G)
Italian (audio: 15G)
Portuguese (audio: 9.3G)
Polish (audio: 6.2G)

https://arxiv.org/pdf/2012.03411.pdf

MLS is licensed under the CC BY 4.0 and is derived from read audiobooks from LibriVox and is available in 8 languages. I find it quite similar to M-AILABS in terms of quality. (not good but passable)

It comes in two folders mls_${lang}_opus (with audio in opus format) and mls_lm_${lang} (for the transcription - we don’t really need it for TTS -).

mls_${lang}_opus has some metadata on the speaker:

  SPEAKER   |   GENDER   | PARTITION  |  MINUTES   |  BOOK ID   |             TITLE              |            CHAPTER            
   10065    |     M      |   train    |   9.002    |   10039    | Saint Évangile selon Saint Marc |          Chapitre 09

Segmented sentences…

1406_1028_000000	http://www.archive.org/download/les1001nuits_tome1_0711_librivox/1001nuits1_010_galland_64kb.mp3	210.62	227.11

… and their transcript

1406_1028_000000	pendant le second siècle je fis serment d'ouvrir tous les trésors de la terre à quiconque me mettrait en liberté mais je ne fus pas plus heureux dans le troisième je promis de faire puissant monarque mon libérateur d'être toujours près de lui en esprit

Opus files are located in audio/*/*/*.opus (for e.g. sentence 1406_1028_000000 opus is located at audio/1406/1028/1406_1028_000000.opus)

❯ opusinfo audio/1406/1028/1406_1028_000000.opus
Processing file "audio/1406/1028/1406_1028_000000.opus"...

New logical stream (#1, serial: 46a945e0): type opus
Encoded with libopus 1.1.2
User comments section follows...
        ENCODER=opusenc from opus-tools 0.1.10
        ENCODER_OPTIONS=--quiet
        artist=Antoine Galland
        title=010 - 10eme nuit
        album=Les mille et une nuits, tome 1
        encoder=Lavf57.83.100
Opus stream 1:
        Pre-skip: 312
        Playback gain: 0 dB
        Channels: 1
        Original sample rate: 16000 Hz
        Packet duration:   20.0ms (max),   20.0ms (avg),   20.0ms (min)
        Page duration:   1000.0ms (max),  970.6ms (avg),  500.0ms (min)
        Total data length: 71137 bytes (overhead: 2.99%)
        Playback length: 0m:16.489s
        Average bitrate: 34.51 kbit/s, w/o overhead: 33.48 kbit/s
Logical stream 1 ended

Alternative Solutions

We can always higher professional actors and actresses to say stuff in a recording studio but that’s a lot more expansive.

Additional context

Overall it’s not a perfect solution but I’ll gladly take the added data.

It’s going to be tricky though.

Because it’s multi-lingual (we have the formater work for them all)
Because it’s in the opus format
The sentences are maybe too long to train so we have to control the max length of each file.

Issue Analytics

State:
Created a year ago
Reactions:2
Comments:6 (1 by maintainers)

Top GitHub Comments

5reactions

WeberJuliancommented, Jun 16, 2022

There is already a formatter for MLS, you just have to convert the data to wav. https://github.com/coqui-ai/TTS/blob/c44e39d9d6bfeea15c6e600c6167663c0f9196ea/TTS/tts/datasets/formatters.py#L421

I’m closing this but you can continue discussing here 😃

0reactions

wasertechcommented, Jun 17, 2022

| > Found 80109 files in /mnt/Données II/Données/TTS/data/extracted/M-AILABS/fr_FR_22.05K
| > Found 258213 files in /mnt/Données II/Données/TTS/data/extracted/MLS/mls_french_wav_22.05K

That makes a big difference!

Top Results From Across the Web

Feature request - more value formatting for dashboards

I would like to see some more options for value formatting in dashboard charts: Format as percent – I have billboards that I'd...

[Feature Request] Align assignments in column #1192 - GitHub

Add options to enable format code as following: Object initializer: // From var foo ... [Feature Request] Align assignments in column #1192.

How to Add Geolocation to MLS Data - Geocodio

See how to add latitude/longitude coordinates to your MLS data sets. ... addresses into a spreadsheet with the address information consistently formatted.

[Feature Request] Python Autoformatting in Notebooks - Kaggle

I realise this doesn't seem to have been previously requested but it seems like a helpful feature to me. I use Kaggle notebooks...

Mls data information form: Fill out & sign online - DocHub

Edit, sign, and share mls data information form online. No need to install software, just go to DocHub, and sign up instantly and...