question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature request]: Add formater for MLS dataset

See original GitHub issue

🚀 Feature Description

I’m sick of listening to the only french model we have now (Tacotron2-DDC+MelGAN) so I started training on M-AILABS-fr but there is not enough data to train from scratch really…

Solution

Use Multilingual LibriSpeech (MLS) so not only french gains more data to play with but also all those languages:

  • English (audio: 2.4T)
  • German (audio: 115G)
  • Dutch (audio: 86G)
  • French (audio: 61G)
  • Spanish (audio: 50G)
  • Italian (audio: 15G)
  • Portuguese (audio: 9.3G)
  • Polish (audio: 6.2G)

https://arxiv.org/pdf/2012.03411.pdf

MLS is licensed under the CC BY 4.0 and is derived from read audiobooks from LibriVox and is available in 8 languages. I find it quite similar to M-AILABS in terms of quality. (not good but passable)

It comes in two folders mls_${lang}_opus (with audio in opus format) and mls_lm_${lang} (for the transcription - we don’t really need it for TTS -).

mls_${lang}_opus has some metadata on the speaker:

  SPEAKER   |   GENDER   | PARTITION  |  MINUTES   |  BOOK ID   |             TITLE              |            CHAPTER            
   10065    |     M      |   train    |   9.002    |   10039    | Saint Évangile selon Saint Marc |          Chapitre 09 

Segmented sentences…

1406_1028_000000	http://www.archive.org/download/les1001nuits_tome1_0711_librivox/1001nuits1_010_galland_64kb.mp3	210.62	227.11

… and their transcript

1406_1028_000000	pendant le second siècle je fis serment d'ouvrir tous les trésors de la terre à quiconque me mettrait en liberté mais je ne fus pas plus heureux dans le troisième je promis de faire puissant monarque mon libérateur d'être toujours près de lui en esprit

Opus files are located in audio/*/*/*.opus (for e.g. sentence 1406_1028_000000 opus is located at audio/1406/1028/1406_1028_000000.opus)

❯ opusinfo audio/1406/1028/1406_1028_000000.opus
Processing file "audio/1406/1028/1406_1028_000000.opus"...

New logical stream (#1, serial: 46a945e0): type opus
Encoded with libopus 1.1.2
User comments section follows...
        ENCODER=opusenc from opus-tools 0.1.10
        ENCODER_OPTIONS=--quiet
        artist=Antoine Galland
        title=010 - 10eme nuit
        album=Les mille et une nuits, tome 1
        encoder=Lavf57.83.100
Opus stream 1:
        Pre-skip: 312
        Playback gain: 0 dB
        Channels: 1
        Original sample rate: 16000 Hz
        Packet duration:   20.0ms (max),   20.0ms (avg),   20.0ms (min)
        Page duration:   1000.0ms (max),  970.6ms (avg),  500.0ms (min)
        Total data length: 71137 bytes (overhead: 2.99%)
        Playback length: 0m:16.489s
        Average bitrate: 34.51 kbit/s, w/o overhead: 33.48 kbit/s
Logical stream 1 ended

Alternative Solutions

We can always higher professional actors and actresses to say stuff in a recording studio but that’s a lot more expansive.

Additional context

Overall it’s not a perfect solution but I’ll gladly take the added data.

It’s going to be tricky though.

  1. Because it’s multi-lingual (we have the formater work for them all)
  2. Because it’s in the opus format
  3. The sentences are maybe too long to train so we have to control the max length of each file.

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:2
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

5reactions
WeberJuliancommented, Jun 16, 2022

There is already a formatter for MLS, you just have to convert the data to wav. https://github.com/coqui-ai/TTS/blob/c44e39d9d6bfeea15c6e600c6167663c0f9196ea/TTS/tts/datasets/formatters.py#L421

I’m closing this but you can continue discussing here 😃

0reactions
wasertechcommented, Jun 17, 2022
| > Found 80109 files in /mnt/Données II/Données/TTS/data/extracted/M-AILABS/fr_FR_22.05K
| > Found 258213 files in /mnt/Données II/Données/TTS/data/extracted/MLS/mls_french_wav_22.05K

That makes a big difference!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Feature request - more value formatting for dashboards
I would like to see some more options for value formatting in dashboard charts: Format as percent – I have billboards that I'd...
Read more >
[Feature Request] Align assignments in column #1192 - GitHub
Add options to enable format code as following: Object initializer: // From var foo ... [Feature Request] Align assignments in column #1192.
Read more >
How to Add Geolocation to MLS Data - Geocodio
See how to add latitude/longitude coordinates to your MLS data sets. ... addresses into a spreadsheet with the address information consistently formatted.
Read more >
[Feature Request] Python Autoformatting in Notebooks - Kaggle
I realise this doesn't seem to have been previously requested but it seems like a helpful feature to me. I use Kaggle notebooks...
Read more >
Mls data information form: Fill out & sign online - DocHub
Edit, sign, and share mls data information form online. No need to install software, just go to DocHub, and sign up instantly and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found