FineTuning HiFi with GLowTTS npy
See original GitHub issueHello! I’m trying to FineTuning HiFi with GlowTTS npy i generate npy with this code:
def TTS(tst_stn, path):
if getattr(hps.data, "add_blank", False):
text_norm = text_to_sequence(tst_stn.strip(), ['english_cleaners'], cmu_dict)
text_norm = commons.intersperse(text_norm, len(symbols))
else:
tst_stn = " " + tst_stn.strip() + " "
text_norm = text_to_sequence(tst_stn.strip(), ['english_cleaners'], cmu_dict)
sequence = np.array(text_norm)[None, :]
x_tst = torch.autograd.Variable(torch.from_numpy(sequence)).cuda().long()
x_tst_lengths = torch.tensor([x_tst.shape[1]]).cuda()
with torch.no_grad():
noise_scale = 0.667
length_scale = 1.0
(y_gen_tst, *_), *_, (attn_gen, *_) = model(x_tst, x_tst_lengths, gen=True, noise_scale=noise_scale, length_scale=length_scale)
np.save("hf/ft_dataset/" + path.split('/')[1] + '.npy', y_gen_tst.cpu().detach().numpy())
Next, I make a metafile: wavs/x.wav | ft_dataset/x.npy
And I get the following error: RuntimeError: stack expects each tensor to be equal size, but got [8192] at entry 0 and [6623] at entry 6
Hi-Fi generates wav using these npy in inference mode with GlowTTS
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (2 by maintainers)
Top Results From Across the Web
Fine-tuning a TTS model - TTS 0.10.0 documentation
Fine-tuning takes a pre-trained model, and retrains it to improve the model performance on a different task or dataset. In TTS we provide...
Read more >What are the TTS models you know to be faster than Tacotron?
I tried to use hifiGAN, but it looks like the Mozilla TTS spectrogram is not compatible with it, as it is and it...
Read more >hifi-gan - PyPI
In our paper, we proposed HiFi-GAN: a GAN-based model capable of generating high fidelity speech efficiently.
Read more >coqui-ai/TTS: v0.0.13 - Zenodo
Glow-TTS updates to import SC-Glow Models. Fixing windows support (:crown: @WeberJulian ) ... HiFiGAN vocoder finetuned for the above model.
Read more >hifi-gan - PyDigger
HiFi -GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech ... should match the audio file and the extension should be `.npy`....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

@4nton-P
Hello. To get mel-spectrograms for fine tuning, you need to make some changes to the code. If you set the ‘gen’ argument to True, the length of the generated mel-spectrogram may not match the length of the ground truth audio. In the branch where ‘gen’ of the forward operation is False, there is a part that generates mean and variance using the output of the encoder and the output of the decoder. If you use these to sample z from Gaussian and feed it to the decoder with ‘reverse=True’, you will get the desired result. See lines 313 and 299 in models.py. And ‘noise_scale’ can affect the quality. You will get good results with the default settings, but experimenting with various ‘noise_scale’ would be a good try.
I am taking mels from fastspeech2 and trying to input it to hifigan to generate audio but I am getting noise in the audio file . I made it shape compatible but there are problems internally . please share your idea that I can try.