Lower quality voice synthesize with the new version
See original GitHub issueI trained a speech model and the vocoder using the new version (espnet==0.10.6 and parallel_wavegan==0.5.4) and the synthesized voices have low quality. But when I use the old espnet (espnet==0.9.6 and parallel_wavegan==0.4.8) for inference the outputs have good quality. The new code for inference looks like this (the version that doesn’t work for me):
tts = Text2Speech.from_pretrained(model_file="exp/tts_train_conformer_fastspeech2_raw_phn_tacotron_g2p_en_no_space/1000epoch.pth",
vocoder_file="/home/rohola/codes/ryan-tts/models/vocoders/train_nodev_ryanspeech_parallel_wavegan.v1/checkpoint-400000steps.pkl",
speed_control_alpha=1.0,)
wav = tts("Hi I am a good guy. This is a test.")["wav"]
write("x.wav", 22050, wav.view(-1).cpu().numpy())
while the same model works fine with old inference code.
text2speech = Text2Speech(
train_config="exp/tts_train_raw_phn_tacotron_g2p_en_no_space/config.yaml",
model_file="exp/tts_train_raw_phn_tacotron_g2p_en_no_space/200epoch.pth",
device="cuda",
# Only for Tacotron 2
threshold=0.5,
minlenratio=0.0,
maxlenratio=10.0,
use_att_constraint=False,
backward_window=1,
forward_window=3
)
text2speech.spc2wav = None
vocoder = load_model(config.vocoder_model).to("cuda").eval()
vocoder.remove_weight_norm()
with torch.no_grad():
wav, outs, outs_denorm, probs, att_ws, durations, focus_rate = text2speech(text)
wav = vocoder.inference(outs)
write("x.wav", 22050, wav.view(-1).cpu().numpy())
The new Text2Speech
works perfectly fine with model_tag and vocoder_tag. I tried to use predefined model_tag (like kan-bayashi/ljspeech_tacotron2) with my trained vocoder (using new espent) and it still gives low volume outputs.
Basic environments:
- OS information: Ubuntu 18.04 x86_64]
- python version: new code python 3.8.12 and old code 3.8.7
- espnet version: above
- pytorch version: new code 1.10.2+cu102, old code 1.7.1
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
Audio Super Resolution Turns Low-Quality Voice Samples ...
With a low-quality recording, AI cannot precisely recreate speech and accurately mimic the pitch, tone, and pace of a real human voice. For ......
Read more >Overdub: Natural-Sounding Text-to-Speech - Descript
Try Overdub, a text-to-speech generator, & create high-quality TTS model of your voice. Or select from over a dozen stock human voices for...
Read more >New Software Allows You To Synthesize Speech In Any Voice
The program is shown synthesizing a man's voice to read different sentences based on the software's analysis of a real clip of him...
Read more >Amazon previews Alexa capability that synthesizes a person's ...
Amazon previews Alexa capability that synthesizes a person's voice from less than a minute of audio. In-depth Amazon coverage from the tech ...
Read more >Downloads: Synthesizers - Freedom Scientific
Use this page to download the appropriate synthesizer that works with the ... Vocalizer Expressive Voices Version 2 for Fusion 2018 or later,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Great! Sometimes, we use different mean-var for text2mel and vocoder, so I decided to process as follows:
Thank you for your sharing. It seems normalization issue. Does
/home/rohola/codes/ryan-tts/models/vocoders/train_nodev_ryanspeech_parallel_wavegan.v1
includestats.h5
file?