question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Lower quality voice synthesize with the new version

See original GitHub issue

I trained a speech model and the vocoder using the new version (espnet==0.10.6 and parallel_wavegan==0.5.4) and the synthesized voices have low quality. But when I use the old espnet (espnet==0.9.6 and parallel_wavegan==0.4.8) for inference the outputs have good quality. The new code for inference looks like this (the version that doesn’t work for me):

tts = Text2Speech.from_pretrained(model_file="exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/1000epoch.pth",
                                  vocoder_file="/home/rohola/codes/ryan-tts/models/vocoders/train_nodev_ryanspeech_parallel_wavegan.v1/checkpoint-400000steps.pkl",
                                  speed_control_alpha=1.0,)
wav = tts("Hi I am a good guy. This is a test.")["wav"]

write("x.wav", 22050, wav.view(-1).cpu().numpy())

while the same model works fine with old inference code.

text2speech = Text2Speech(
        train_config="exp/tts_train_raw_phn_tacotron_g2p_en_no_space/config.yaml",
        model_file="exp/tts_train_raw_phn_tacotron_g2p_en_no_space/200epoch.pth",
        device="cuda",
        # Only for Tacotron 2
        threshold=0.5,
        minlenratio=0.0,
        maxlenratio=10.0,
        use_att_constraint=False,
        backward_window=1,
        forward_window=3
    )

text2speech.spc2wav = None 

vocoder = load_model(config.vocoder_model).to("cuda").eval()
vocoder.remove_weight_norm()

with torch.no_grad():
    wav, outs, outs_denorm, probs, att_ws, durations, focus_rate = text2speech(text)
    wav = vocoder.inference(outs)

write("x.wav", 22050, wav.view(-1).cpu().numpy())

The new Text2Speech works perfectly fine with model_tag and vocoder_tag. I tried to use predefined model_tag (like kan-bayashi/ljspeech_tacotron2) with my trained vocoder (using new espent) and it still gives low volume outputs.

Basic environments:

  • OS information: Ubuntu 18.04 x86_64]
  • python version: new code python 3.8.12 and old code 3.8.7
  • espnet version: above
  • pytorch version: new code 1.10.2+cu102, old code 1.7.1

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
kan-bayashicommented, Feb 26, 2022

Great! Sometimes, we use different mean-var for text2mel and vocoder, so I decided to process as follows:

  • text -> text2mel -> mel norm -> denorm by text2mel stats -> mel denorm -> norm by vocoder stats -> mel norm -> vocoder -> wav (with both stats case)
  • text -> text2mel -> mel norm -> denorm by text2mel stats -> mel denorm -> vocoder -> wav (with only text2mel stats case)
  • text -> text2mel -> mel -> vocoder -> wav (without stats case)
1reaction
kan-bayashicommented, Feb 25, 2022

Thank you for your sharing. It seems normalization issue. Does /home/rohola/codes/ryan-tts/models/vocoders/train_nodev_ryanspeech_parallel_wavegan.v1 include stats.h5 file?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Audio Super Resolution Turns Low-Quality Voice Samples ...
With a low-quality recording, AI cannot precisely recreate speech and accurately mimic the pitch, tone, and pace of a real human voice. For ......
Read more >
Overdub: Natural-Sounding Text-to-Speech - Descript
Try Overdub, a text-to-speech generator, & create high-quality TTS model of your voice. Or select from over a dozen stock human voices for...
Read more >
New Software Allows You To Synthesize Speech In Any Voice
The program is shown synthesizing a man's voice to read different sentences based on the software's analysis of a real clip of him...
Read more >
Amazon previews Alexa capability that synthesizes a person's ...
Amazon previews Alexa capability that synthesizes a person's voice from less than a minute of audio. In-depth Amazon coverage from the tech ...
Read more >
Downloads: Synthesizers - Freedom Scientific
Use this page to download the appropriate synthesizer that works with the ... Vocalizer Expressive Voices Version 2 for Fusion 2018 or later,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found