Pitch extraction for FastSpeech2
See original GitHub issueAccording to the FastSpeech2 paper page 14,
- We use linear interpolation to fill the unvoiced frame in pitch contour
- We transform the resulting pitch contour to logarithmic scale
- We normalize it to zero mean and unit variance for each utterance, and we have to save the original utterance-level mean and variance for pitch contour reconstruction
- We convert the normalized pitch contour to pitch spectrogram using continuous wavelet transform
For (1) and (2), i can see that the Dio
pitch extractor does it correctly.
For (3), i need to set pitch_normalize
in tts.sh.
- Why is pitch_normalize and energy_normalize set to none in
tts.sh
line 659? It is ignoring the yaml config. - Does the normalizer save the original mean and variance?
For (4), i can’t find the wavelet transform anywhere. I have no idea how to implement this because it we enforce pitch length == duration length == text length
to align the input timesteps. But the transform relies on equally spaced wavelet positions, whereas the durations are variable!
take the pitch spectrogram as the training target for the pitch predictor which is optimized with MSE loss.
Currently, we are doing MSE loss on p_outs
and ps
, which are the log F0 pitch values, not spectrogram.
In the paper, page 4,
To take the pitch contour as input in both training and inference, we quantize pitch F0 (ground-truth/predicted value for train/inference respectively) of each frame to 256 possible values in log-scale and further convert it into pitch embedding vector p and add it to the expanded hidden sequence.
My understanding is: take the log-F0 values, quantize into {0, 1, ... 255}
and convert into a one-hot vector, then put it through an embedding layer. This seems different from the conv1d
we are using.
If i train without (3) and (4), i get unstable pitch loss:
Issue Analytics
- State:
- Created a year ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
For question (4), we use FastPitch-style strategy rather than FastSpeech2. https://github.com/espnet/espnet/blob/92ea573f44e252644860b6b7906916218e4ab78b/espnet2/tts/fastspeech2/fastspeech2.py#L36-L44 The following discussion may help you. https://github.com/espnet/espnet/issues/2019#issuecomment-657169925
You are right, i am using LJSpeech and the pitch loss looks OK after longer training. I can’t see the epoch loss well because the X are too close together.