question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

What is the proper way to train FastSpeech2 with MFA?

See original GitHub issue

I have read the TTS training guide and i follow this issue.

Now i want to summarize the process to help everyone. Please correct if i’m wrong.

  • Use MFA validate to get OOVs, run MFA g2p on OOVs and MFA align with expanded dictionary to obtain .TextGrid files
  • Generate data/train/text file to use phoneme transcription instead of text, according to this. Add punctuation marks and silences to maintain alignments (see below).
  • Generate data/train/durations file from .TextGrid in this format. Convert TextGrid seconds into # melspec frames. Add durations for punctuation and silences, then add 0 to each line according to this. Total durations should match total #melspec frames in audio.
  • In run.sh, set --cleaner none and --g2p none as we’re using MFA g2p – see issue here
  • Edit local/data.sh: comment out the lines with rm ${text} and paste -d ... > ${text} which will overwrite your data/train/text file
  • Run stage 1-4 as usual
  • Run stage 5 with ./run.sh --stage 5 --stop_stage 5 --train_config conf/tuning/train_fastspeech2.yaml --teacher_dumpdir data --tts_stats_dir data/stats --write_collected_feats true: reference
  • Train the rest normally

For inference, you should first do normal inference to extract features:

./run_mfa.sh --stage 7 --stop-stage 7 \
    --tts_exp exp/default_fastspeech2 \
    --inference_config conf/tuning/decode_fastspeech.yaml \
    --inference_model valid.loss.ave_5best.pth \
    --test_sets eval1

Then go to https://github.com/kan-bayashi/ParallelWaveGAN and install it. Download your chosen vocoder .pkl, stats.h5and config.yml and run

parallel-wavegan-decode \
    --checkpoint parallel_wavegan/checkpoint-3000000steps.pkl \
    --feats-scp exp/*/decode_fastspeech_valid.loss.ave_5best/eval1/norm/feats.scp \
    --outdir exp/*/parallel_wavegan

I don’t know why, but running stage 7 directly with --vocoder_file makes the generated audio too soft and noisy.

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:2
  • Comments:10 (8 by maintainers)

github_iconTop GitHub Comments

3reactions
kan-bayashicommented, Jul 21, 2022

Use MFA to obtain the .TextGrid files (done)

OK

Train stage 1-3 as usual (done)

OK

Create tokens.txt with the MFA phonemes. But since the default is https://github.com/espnet/espnet/issues/3340 we need to set cleaner=none and g2p=none during training and edit every data/*/text file to use phoneme transcription instead of text, according to https://github.com/espnet/espnet/issues/3870. Correct?

Right.

Create durations file in https://github.com/espnet/espnet/issues/2632#issuecomment-734211026, add 0 to each line according to https://github.com/espnet/espnet/issues/4113, and put it https://github.com/espnet/espnet/issues/2712#issuecomment-732671515. The durations are melspec frames. Since MFA TextGrids use seconds, i have to convert it based on fs, n_fft and n_shift. Correct?

Right. But you should add 0 at the end of each line of durations file, which is for <eos> symbol (<eos> is added internally).

Run stage 5 according to https://github.com/espnet/espnet/issues/2712#issuecomment-732671515 Train the rest normally

Right.

1reaction
kan-bayashicommented, Jul 26, 2022

@iamanigeeit Sorry for the late reply.

for each line in the durations file, adding all the numbers should give me the number of melspec frames, right?

Right. It assumes that sum(duration) == #mel frames.

so i need to add them back into the data/*/text and insert their duration into durations?

Right. You need to restore them. And please add 0 for <eos> like:

data/*/text sample1 j ɛ s , b ɐ t sil
durations sample1 3 4 3 7 3 5 3 2 0
Read more comments on GitHub >

github_iconTop Results From Across the Web

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
The training of FastSpeech relies on an autoregressive teacher model to ... with MFA (an open source text-to-audio alignment toolkit).
Read more >
fastspeech 2: fast and high-quality end-to- end text to speech
We train MFA on our training set only without other external dataset. We will work on non-autoregressive TTS without external alignment models ...
Read more >
TensorFlowTTS - Python Package Health Analysis - Snyk
With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using fake-quantize aware and pruning, make TTS models can be run faster ......
Read more >
The Importance of Accurate Alignments in End-to-End Speech ...
As end-to-end synthesis progressed from Tacotron to FastSpeech2, ... Flowchart illustrating the training and synthesis phases of the TTS ...
Read more >
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
Specifically, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found