Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pitch extraction for FastSpeech2

See original GitHub issue

According to the FastSpeech2 paper page 14,

We use linear interpolation to fill the unvoiced frame in pitch contour

We transform the resulting pitch contour to logarithmic scale

We normalize it to zero mean and unit variance for each utterance, and we have to save the original utterance-level mean and variance for pitch contour reconstruction

We convert the normalized pitch contour to pitch spectrogram using continuous wavelet transform

For (1) and (2), i can see that the Dio pitch extractor does it correctly.

For (3), i need to set pitch_normalize in tts.sh.

Why is pitch_normalize and energy_normalize set to none in tts.sh line 659? It is ignoring the yaml config.
Does the normalizer save the original mean and variance?

For (4), i can’t find the wavelet transform anywhere. I have no idea how to implement this because it we enforce pitch length == duration length == text length to align the input timesteps. But the transform relies on equally spaced wavelet positions, whereas the durations are variable!

take the pitch spectrogram as the training target for the pitch predictor which is optimized with MSE loss.

Currently, we are doing MSE loss on p_outs and ps, which are the log F0 pitch values, not spectrogram.

In the paper, page 4,

To take the pitch contour as input in both training and inference, we quantize pitch F0 (ground-truth/predicted value for train/inference respectively) of each frame to 256 possible values in log-scale and further convert it into pitch embedding vector p and add it to the expanded hidden sequence.

My understanding is: take the log-F0 values, quantize into {0, 1, ... 255} and convert into a one-hot vector, then put it through an embedding layer. This seems different from the conv1d we are using.

If i train without (3) and (4), i get unstable pitch loss:

pitch_loss loss

Issue Analytics

State:
Created a year ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

kan-bayashicommented, Jul 29, 2022

For question (4), we use FastPitch-style strategy rather than FastSpeech2. https://github.com/espnet/espnet/blob/92ea573f44e252644860b6b7906916218e4ab78b/espnet2/tts/fastspeech2/fastspeech2.py#L36-L44 The following discussion may help you. https://github.com/espnet/espnet/issues/2019#issuecomment-657169925

0reactions

iamanigeeitcommented, Aug 8, 2022

You are right, i am using LJSpeech and the pitch loss looks OK after longer training. I can’t see the epoch loss well because the X are too close together.

Top Results From Across the Web

FastSpeech 2 Explained | Papers With Code

Specifically, in FastSpeech 2, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and...

fastspeech 2: fast and high-quality end-to- end text to speech

Specifi- cally, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use ...

FastSpeech2 training report – Weights & Biases - Wandb

Pitch was extracted using pyworld module with the hop_length parameter equal to 256 as specified in the article. The energy was extracted by...

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

2) The duration extracted from the attention map of the teacher model is ... to FastSpeech: in training, we extracted the duration, pitch, ......

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Specifically, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values ......