LibriTTS out-of-box produces incoherent speech
See original GitHub issueUsing out of the box training produces results that are not forming coherent words. Initially running prepare_libri.ipnb with 20 speakers, then running as MFA instructed, I encountered size mismatches, to which I saw running tacotron’s extract_duration.py should resolve - and it has.
So running
bash ttsexamples/mfa_extraction/scripts/prepare_mfa.sh
python ttsexamples/mfa_extraction/run_mfa.py --corpus_directory ./libritts --output_directory ./mfa/parsed --jobs 8
python ttsexamples/mfa_extraction/txt_grid_parser.py \
--yaml_path ttsexamples/fastspeech2_libritts/conf/fastspeech2libritts.yaml \
--dataset_path ./libritts \
--text_grid_path ./mfa/parsed \
--output_durations_path ./libritts/durations \
--sample_rate 24000
tensorflow-tts-preprocess --rootdir ./libritts \
--outdir ./dump_libritts \
--config preprocess/libritts_preprocess.yaml \
--dataset libritts
tensorflow-tts-normalize --rootdir ./dump_libritts \
--outdir ./dump_libritts \
--config preprocess/libritts_preprocess.yaml \
--dataset libritts
-> running the MFA since it generates the train.txt required later
and then extracting durations (for train and valid)
CUDA_VISIBLE_DEVICES=0 python ttsexamples/tacotron2/extract_duration.py \
--rootdir ./dump_libritts/train/ \
--outdir ./dump_libritts/train/durations/ \
--checkpoint ./ttsexamples/tacotron2/exp/train.tacotron2.v1/checkpoints/model-120000.h5 \
--use-norm 1 \
--config ./ttsexamples/tacotron2/conf/tacotron2.v1.yaml \
--batch-size 32 \
--win-front 3 \
--win-back 3
and finally running
bash ttsexamples/fastspeech2_libritts/scripts/train_libri.sh
This ultimately does not generate proper speech, testing with the libritts pretrained vocoder (nor other vocoders)
config = AutoConfig.from_pretrained("../pretrained/mbvocs24k/multiband_melgan.v1_24k.yaml")
mb_melgan = TFAutoModel.from_pretrained(
config=config,
pretrained_path='../pretrained/mbvocs24k/libritts_24k.h5', # "../examples/fastspeech2/checkpoints/model-150000.h5",
name="melgan"
)
Notes: I’ve changed the hop size to 300 in the yaml configurations according to previous issues.
Would appreciate any hint on what is going on/what’s wrong. Would love to upload and contribute a generated model at the end
Issue Analytics
- State:
- Created 2 years ago
- Comments:5
@shachar-ug What Tacotron2 model are you using for extracting durations? It has to match the exact same dataset you’re trying to train it on.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.