question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pitch extraction for FastSpeech2

See original GitHub issue

According to the FastSpeech2 paper page 14,

  1. We use linear interpolation to fill the unvoiced frame in pitch contour
  2. We transform the resulting pitch contour to logarithmic scale
  3. We normalize it to zero mean and unit variance for each utterance, and we have to save the original utterance-level mean and variance for pitch contour reconstruction
  4. We convert the normalized pitch contour to pitch spectrogram using continuous wavelet transform

For (1) and (2), i can see that the Dio pitch extractor does it correctly.

For (3), i need to set pitch_normalize in tts.sh.

  • Why is pitch_normalize and energy_normalize set to none in tts.sh line 659? It is ignoring the yaml config.
  • Does the normalizer save the original mean and variance?

For (4), i can’t find the wavelet transform anywhere. I have no idea how to implement this because it we enforce pitch length == duration length == text length to align the input timesteps. But the transform relies on equally spaced wavelet positions, whereas the durations are variable!

take the pitch spectrogram as the training target for the pitch predictor which is optimized with MSE loss.

Currently, we are doing MSE loss on p_outs and ps, which are the log F0 pitch values, not spectrogram.

In the paper, page 4,

To take the pitch contour as input in both training and inference, we quantize pitch F0 (ground-truth/predicted value for train/inference respectively) of each frame to 256 possible values in log-scale and further convert it into pitch embedding vector p and add it to the expanded hidden sequence.

My understanding is: take the log-F0 values, quantize into {0, 1, ... 255} and convert into a one-hot vector, then put it through an embedding layer. This seems different from the conv1d we are using.

If i train without (3) and (4), i get unstable pitch loss:

pitch_loss loss

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
kan-bayashicommented, Jul 29, 2022
0reactions
iamanigeeitcommented, Aug 8, 2022

You are right, i am using LJSpeech and the pitch loss looks OK after longer training. I can’t see the epoch loss well because the X are too close together.

Read more comments on GitHub >

github_iconTop Results From Across the Web

FastSpeech 2 Explained | Papers With Code
Specifically, in FastSpeech 2, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and...
Read more >
fastspeech 2: fast and high-quality end-to- end text to speech
Specifi- cally, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use ...
Read more >
FastSpeech2 training report – Weights & Biases - Wandb
Pitch was extracted using pyworld module with the hop_length parameter equal to 256 as specified in the article. The energy was extracted by...
Read more >
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
2) The duration extracted from the attention map of the teacher model is ... to FastSpeech: in training, we extracted the duration, pitch, ......
Read more >
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
Specifically, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found