Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Lower quality voice synthesize with the new version

See original GitHub issue

I trained a speech model and the vocoder using the new version (espnet==0.10.6 and parallel_wavegan==0.5.4) and the synthesized voices have low quality. But when I use the old espnet (espnet==0.9.6 and parallel_wavegan==0.4.8) for inference the outputs have good quality. The new code for inference looks like this (the version that doesn’t work for me):

tts = Text2Speech.from_pretrained(model_file="exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/1000epoch.pth",
                                  vocoder_file="/home/rohola/codes/ryan-tts/models/vocoders/train_nodev_ryanspeech_parallel_wavegan.v1/checkpoint-400000steps.pkl",
                                  speed_control_alpha=1.0,)
wav = tts("Hi I am a good guy. This is a test.")["wav"]

write("x.wav", 22050, wav.view(-1).cpu().numpy())

while the same model works fine with old inference code.

text2speech = Text2Speech(
        train_config="exp/tts_train_raw_phn_tacotron_g2p_en_no_space/config.yaml",
        model_file="exp/tts_train_raw_phn_tacotron_g2p_en_no_space/200epoch.pth",
        device="cuda",
        # Only for Tacotron 2
        threshold=0.5,
        minlenratio=0.0,
        maxlenratio=10.0,
        use_att_constraint=False,
        backward_window=1,
        forward_window=3
    )

text2speech.spc2wav = None 

vocoder = load_model(config.vocoder_model).to("cuda").eval()
vocoder.remove_weight_norm()

with torch.no_grad():
    wav, outs, outs_denorm, probs, att_ws, durations, focus_rate = text2speech(text)
    wav = vocoder.inference(outs)

write("x.wav", 22050, wav.view(-1).cpu().numpy())

The new Text2Speech works perfectly fine with model_tag and vocoder_tag. I tried to use predefined model_tag (like kan-bayashi/ljspeech_tacotron2) with my trained vocoder (using new espent) and it still gives low volume outputs.

Basic environments:

OS information: Ubuntu 18.04 x86_64]
python version: new code python 3.8.12 and old code 3.8.7
espnet version: above
pytorch version: new code 1.10.2+cu102, old code 1.7.1

Issue Analytics

State:
Created 2 years ago
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

kan-bayashicommented, Feb 26, 2022

Great! Sometimes, we use different mean-var for text2mel and vocoder, so I decided to process as follows:

text -> text2mel -> mel norm -> denorm by text2mel stats -> mel denorm -> norm by vocoder stats -> mel norm -> vocoder -> wav (with both stats case)
text -> text2mel -> mel norm -> denorm by text2mel stats -> mel denorm -> vocoder -> wav (with only text2mel stats case)
text -> text2mel -> mel -> vocoder -> wav (without stats case)

1reaction

kan-bayashicommented, Feb 25, 2022

Thank you for your sharing. It seems normalization issue. Does /home/rohola/codes/ryan-tts/models/vocoders/train_nodev_ryanspeech_parallel_wavegan.v1 include stats.h5 file?