Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

What is the proper way to train FastSpeech2 with MFA?

See original GitHub issue

I have read the TTS training guide and i follow this issue.

Now i want to summarize the process to help everyone. Please correct if i’m wrong.

Use MFA validate to get OOVs, run MFA g2p on OOVs and MFA align with expanded dictionary to obtain .TextGrid files
Generate data/train/text file to use phoneme transcription instead of text, according to this. Add punctuation marks and silences to maintain alignments (see below).
Generate data/train/durations file from .TextGrid in this format. Convert TextGrid seconds into # melspec frames. Add durations for punctuation and silences, then add 0 to each line according to this. Total durations should match total #melspec frames in audio.
In run.sh, set --cleaner none and --g2p none as we’re using MFA g2p – see issue here
Edit local/data.sh: comment out the lines with rm ${text} and paste -d ... > ${text} which will overwrite your data/train/text file
Run stage 1-4 as usual
Run stage 5 with ./run.sh --stage 5 --stop_stage 5 --train_config conf/tuning/train_fastspeech2.yaml --teacher_dumpdir data --tts_stats_dir data/stats --write_collected_feats true: reference
Train the rest normally

For inference, you should first do normal inference to extract features:

./run_mfa.sh --stage 7 --stop-stage 7 \
    --tts_exp exp/default_fastspeech2 \
    --inference_config conf/tuning/decode_fastspeech.yaml \
    --inference_model valid.loss.ave_5best.pth \
    --test_sets eval1

Then go to https://github.com/kan-bayashi/ParallelWaveGAN and install it. Download your chosen vocoder .pkl, stats.h5and config.yml and run

parallel-wavegan-decode \
    --checkpoint parallel_wavegan/checkpoint-3000000steps.pkl \
    --feats-scp exp/*/decode_fastspeech_valid.loss.ave_5best/eval1/norm/feats.scp \
    --outdir exp/*/parallel_wavegan

I don’t know why, but running stage 7 directly with --vocoder_file makes the generated audio too soft and noisy.

Issue Analytics

State:
Created a year ago
Reactions:2
Comments:10 (8 by maintainers)

Top GitHub Comments

3reactions

kan-bayashicommented, Jul 21, 2022

Use MFA to obtain the .TextGrid files (done)

Train stage 1-3 as usual (done)

Create tokens.txt with the MFA phonemes. But since the default is https://github.com/espnet/espnet/issues/3340 we need to set cleaner=none and g2p=none during training and edit every data/*/text file to use phoneme transcription instead of text, according to https://github.com/espnet/espnet/issues/3870. Correct?

Right.

Create durations file in https://github.com/espnet/espnet/issues/2632#issuecomment-734211026, add 0 to each line according to https://github.com/espnet/espnet/issues/4113, and put it https://github.com/espnet/espnet/issues/2712#issuecomment-732671515. The durations are melspec frames. Since MFA TextGrids use seconds, i have to convert it based on fs, n_fft and n_shift. Correct?

Right. But you should add 0 at the end of each line of durations file, which is for <eos> symbol (<eos> is added internally).

Run stage 5 according to https://github.com/espnet/espnet/issues/2712#issuecomment-732671515 Train the rest normally

Right.

1reaction

kan-bayashicommented, Jul 26, 2022

@iamanigeeit Sorry for the late reply.

for each line in the durations file, adding all the numbers should give me the number of melspec frames, right?

Right. It assumes that sum(duration) == #mel frames.

so i need to add them back into the data/*/text and insert their duration into durations?

Right. You need to restore them. And please add 0 for <eos> like:

data/*/text	sample1	j	ɛ	s	,	b	ɐ	t	sil
durations	sample1	3	4	3	7	3	5	3	2	0

Top Results From Across the Web

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

The training of FastSpeech relies on an autoregressive teacher model to ... with MFA (an open source text-to-audio alignment toolkit).

fastspeech 2: fast and high-quality end-to- end text to speech

We train MFA on our training set only without other external dataset. We will work on non-autoregressive TTS without external alignment models ...

TensorFlowTTS - Python Package Health Analysis - Snyk

With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using fake-quantize aware and pruning, make TTS models can be run faster ......

The Importance of Accurate Alignments in End-to-End Speech ...

As end-to-end synthesis progressed from Tacotron to FastSpeech2, ... Flowchart illustrating the training and synthesis phases of the TTS ...

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Specifically, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values...