question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LibriTTS out-of-box produces incoherent speech

See original GitHub issue

Using out of the box training produces results that are not forming coherent words. Initially running prepare_libri.ipnb with 20 speakers, then running as MFA instructed, I encountered size mismatches, to which I saw running tacotron’s extract_duration.py should resolve - and it has.

So running

    bash ttsexamples/mfa_extraction/scripts/prepare_mfa.sh
    python ttsexamples/mfa_extraction/run_mfa.py --corpus_directory ./libritts --output_directory ./mfa/parsed --jobs 8
    python ttsexamples/mfa_extraction/txt_grid_parser.py \
  --yaml_path ttsexamples/fastspeech2_libritts/conf/fastspeech2libritts.yaml \
  --dataset_path ./libritts \
  --text_grid_path ./mfa/parsed \
  --output_durations_path ./libritts/durations \
  --sample_rate 24000 

    tensorflow-tts-preprocess --rootdir ./libritts \
  --outdir ./dump_libritts \
  --config preprocess/libritts_preprocess.yaml \
  --dataset libritts
    
    tensorflow-tts-normalize --rootdir ./dump_libritts \
  --outdir ./dump_libritts \
  --config preprocess/libritts_preprocess.yaml \
  --dataset libritts

-> running the MFA since it generates the train.txt required later

and then extracting durations (for train and valid)

CUDA_VISIBLE_DEVICES=0 python ttsexamples/tacotron2/extract_duration.py \
  --rootdir ./dump_libritts/train/ \
  --outdir ./dump_libritts/train/durations/ \
  --checkpoint ./ttsexamples/tacotron2/exp/train.tacotron2.v1/checkpoints/model-120000.h5 \
  --use-norm 1 \
  --config ./ttsexamples/tacotron2/conf/tacotron2.v1.yaml \
  --batch-size 32 \
  --win-front 3 \
  --win-back 3

and finally running

bash ttsexamples/fastspeech2_libritts/scripts/train_libri.sh

This ultimately does not generate proper speech, testing with the libritts pretrained vocoder (nor other vocoders)

config = AutoConfig.from_pretrained("../pretrained/mbvocs24k/multiband_melgan.v1_24k.yaml")
mb_melgan = TFAutoModel.from_pretrained(
    config=config, 
    pretrained_path='../pretrained/mbvocs24k/libritts_24k.h5', # "../examples/fastspeech2/checkpoints/model-150000.h5",
    name="melgan"
)

Notes: I’ve changed the hop size to 300 in the yaml configurations according to previous issues.

Would appreciate any hint on what is going on/what’s wrong. Would love to upload and contribute a generated model at the end

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
ZDisketcommented, Sep 14, 2021

@shachar-ug What Tacotron2 model are you using for extracting durations? It has to match the exact same dataset you’re trying to train it on.

0reactions
stale[bot]commented, Nov 15, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech
This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use. It is derived from the original audio and text ...
Read more >
LibriTTS corpus - OpenSLR
The LibriTTS corpus is designed for TTS research. It is derived from the original materials (mp3 audio files from LibriVox and text files...
Read more >
LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech
This paper introduces a new speech corpus called “LibriTTS” designed for text-to-speech use. It is derived from the original audio and text materials...
Read more >
LibriTTS - Google Research
LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate. The LibriTTS corpus is designed...
Read more >
LibriTTS Dataset - Papers With Code
LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by Heiga Zen with...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found