Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Errors when trying to train SC-GlowTTS

See original GitHub issue

Describe the bug I am trying to train SC-GlowTTS model. I downloaded the config from the latest release and tried to launch TTS/bin/train_glow_tts.py. However, I face different errors regarding the missing values in the config. First it was stats_path, then use_noise_augment and now I get AssertionError: 22050 vs 48000, despite the fact that configs state “wav sample-rate. If different than the original data, it is resampled”. What is the proper way to train SC-GlowTTS? 😃

To Reproduce Steps to reproduce the behavior:

Download and unzip SC-GlowTTS config from v0.0.13 release (https://github.com/coqui-ai/TTS/releases/download/v0.0.12/tts_models--en--vctk--sc-glowtts-transformer.zip)
Download and unzip VCTK dataset e. g. from here (link from SC-GlowTTS repo)
Substitute dataset path in config for yours
Download and install glow TTS: git clone https://github.com/coqui-ai/TTS && cd TTS && pip install -e .
Execute with your config path from TTS directory: python TTS/bin/train_glow_tts.py --config_path /path/to/config/

Expected behavior The model trains without errors

Environment (please complete the following information):

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
PyTorch or TensorFlow version (use command below): Pytorch 1.8.1
Python version: Python 3.7.10
CUDA/cuDNN version: CUDA 10.2 cuDNN 7.6.5

Issue Analytics

State:
Created 2 years ago
Comments:15 (6 by maintainers)

Top GitHub Comments

1reaction

Edressoncommented, May 17, 2021

@loganhart420 I recommend that you use the same speaker encoder used in the paper and available here (trained by 330k steps).

In SC-GlowTTS the quality of the speaker encoder is fundamental because it doesn’t receive any extra information from the speaker.

As your batch size is smaller, you should train more. In addition, in the article, we trained the model by 150k steps using the VCTK, which is much smaller and has only 108 speakers. So as you are training in a larger dataset, you need to train more steps.

1reaction

erogolcommented, May 15, 2021

Maybe @Edresson can help as the one who trained the models.

My take is that LibriTTS is a harder dataset and more difficult to reach the same quality.