Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FineTuning HiFi with GLowTTS npy

See original GitHub issue

Hello! I’m trying to FineTuning HiFi with GlowTTS npy i generate npy with this code:

def TTS(tst_stn, path):
    if getattr(hps.data, "add_blank", False):
        text_norm = text_to_sequence(tst_stn.strip(), ['english_cleaners'], cmu_dict)
        text_norm = commons.intersperse(text_norm, len(symbols))
    else: 
        tst_stn = " " + tst_stn.strip() + " "
        text_norm = text_to_sequence(tst_stn.strip(), ['english_cleaners'], cmu_dict)
    sequence = np.array(text_norm)[None, :]
    x_tst = torch.autograd.Variable(torch.from_numpy(sequence)).cuda().long()
    x_tst_lengths = torch.tensor([x_tst.shape[1]]).cuda()
    

   with torch.no_grad():
        noise_scale = 0.667
        length_scale = 1.0
        (y_gen_tst, *_), *_, (attn_gen, *_) = model(x_tst, x_tst_lengths, gen=True, noise_scale=noise_scale, length_scale=length_scale)
        
    np.save("hf/ft_dataset/" + path.split('/')[1]  + '.npy', y_gen_tst.cpu().detach().numpy())

Next, I make a metafile: wavs/x.wav | ft_dataset/x.npy

And I get the following error: RuntimeError: stack expects each tensor to be equal size, but got [8192] at entry 0 and [6623] at entry 6

Hi-Fi generates wav using these npy in inference mode with GlowTTS

Issue Analytics

State:
Created 3 years ago
Comments:8 (2 by maintainers)

Top GitHub Comments

3reactions

jik876commented, Dec 16, 2020

@4nton-P

Hello. To get mel-spectrograms for fine tuning, you need to make some changes to the code. If you set the ‘gen’ argument to True, the length of the generated mel-spectrogram may not match the length of the ground truth audio. In the branch where ‘gen’ of the forward operation is False, there is a part that generates mean and variance using the output of the encoder and the output of the decoder. If you use these to sample z from Gaussian and feed it to the decoder with ‘reverse=True’, you will get the desired result. See lines 313 and 299 in models.py. And ‘noise_scale’ can affect the quality. You will get good results with the default settings, but experimenting with various ‘noise_scale’ would be a good try.

1reaction

Rashi2011commented, Jun 6, 2021

I am taking mels from fastspeech2 and trying to input it to hifigan to generate audio but I am getting noise in the audio file . I made it shape compatible but there are problems internally . please share your idea that I can try.

Top Results From Across the Web

Fine-tuning a TTS model - TTS 0.10.0 documentation

Fine-tuning takes a pre-trained model, and retrains it to improve the model performance on a different task or dataset. In TTS we provide...

What are the TTS models you know to be faster than Tacotron?

I tried to use hifiGAN, but it looks like the Mozilla TTS spectrogram is not compatible with it, as it is and it...

hifi-gan - PyPI

In our paper, we proposed HiFi-GAN: a GAN-based model capable of generating high fidelity speech efficiently.

coqui-ai/TTS: v0.0.13 - Zenodo

Glow-TTS updates to import SC-Glow Models. Fixing windows support (:crown: @WeberJulian ) ... HiFiGAN vocoder finetuned for the above model.

hifi-gan - PyDigger

HiFi -GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech ... should match the audio file and the extension should be `.npy`....