What is the proper way to train FastSpeech2 with MFA?
See original GitHub issueI have read the TTS training guide and i follow this issue.
Now i want to summarize the process to help everyone. Please correct if i’m wrong.
- Use
MFA validate
to get OOVs, runMFA g2p
on OOVs andMFA align
with expanded dictionary to obtain.TextGrid
files - Generate
data/train/text
file to use phoneme transcription instead of text, according to this. Add punctuation marks and silences to maintain alignments (see below). - Generate
data/train/durations
file from.TextGrid
in this format. Convert TextGrid seconds into # melspec frames. Add durations for punctuation and silences, then add 0 to each line according to this. Total durations should match total #melspec frames in audio. - In
run.sh
, set--cleaner none
and--g2p none
as we’re using MFA g2p – see issue here - Edit
local/data.sh
: comment out the lines withrm ${text}
andpaste -d ... > ${text}
which will overwrite yourdata/train/text
file - Run stage 1-4 as usual
- Run stage 5 with
./run.sh --stage 5 --stop_stage 5 --train_config conf/tuning/train_fastspeech2.yaml --teacher_dumpdir data --tts_stats_dir data/stats --write_collected_feats true
: reference - Train the rest normally
For inference, you should first do normal inference to extract features:
./run_mfa.sh --stage 7 --stop-stage 7 \
--tts_exp exp/default_fastspeech2 \
--inference_config conf/tuning/decode_fastspeech.yaml \
--inference_model valid.loss.ave_5best.pth \
--test_sets eval1
Then go to https://github.com/kan-bayashi/ParallelWaveGAN and install it. Download your chosen vocoder .pkl
, stats.h5
and config.yml
and run
parallel-wavegan-decode \
--checkpoint parallel_wavegan/checkpoint-3000000steps.pkl \
--feats-scp exp/*/decode_fastspeech_valid.loss.ave_5best/eval1/norm/feats.scp \
--outdir exp/*/parallel_wavegan
I don’t know why, but running stage 7 directly with --vocoder_file
makes the generated audio too soft and noisy.
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:10 (8 by maintainers)
Top Results From Across the Web
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
The training of FastSpeech relies on an autoregressive teacher model to ... with MFA (an open source text-to-audio alignment toolkit).
Read more >fastspeech 2: fast and high-quality end-to- end text to speech
We train MFA on our training set only without other external dataset. We will work on non-autoregressive TTS without external alignment models ...
Read more >TensorFlowTTS - Python Package Health Analysis - Snyk
With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using fake-quantize aware and pruning, make TTS models can be run faster ......
Read more >The Importance of Accurate Alignments in End-to-End Speech ...
As end-to-end synthesis progressed from Tacotron to FastSpeech2, ... Flowchart illustrating the training and synthesis phases of the TTS ...
Read more >FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
Specifically, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
OK
OK
Right.
Right. But you should add 0 at the end of each line of
durations
file, which is for<eos>
symbol (<eos>
is added internally).Right.
@iamanigeeit Sorry for the late reply.
Right. It assumes that
sum(duration) == #mel frames
.Right. You need to restore them. And please add 0 for
<eos>
like: