Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Korean Zero-shot training

See original GitHub issue

Describe the bug

Dear Author, Thank you for your nice work, I tried apply CoquiTTS for Korean languages. And I also trained this with KSS single dataset, (12844 samples) and seem that it can w coquiTTS.zip ork and synthesize with an acceptable result with GlowTTS and MB-MelGan. The audio result is attached However, when I tried to apply to multi-speaker data (AIHUB) with more thatn 1500 speakers and 30000 samples only, I get the error. Could you help me figure out what is problem. And the better way to do Zero-shot for Korean. The error image will be attached. coquiTTS.zip

To Reproduce

src.zip

Expected behavior

No response

Logs

No response

Environment

-TTS develop version, using TTS folder, and not install TTS
- cuda117
- python 3.7

Additional context

No response

Issue Analytics

State:
Created 9 months ago
Comments:13 (7 by maintainers)

Top GitHub Comments

2reactions

Edressoncommented, Dec 8, 2022

e other questions about the correct way to train multispeaker with coquiTTS for voice cloning purposes. As my understanding, in glowtts config file, we have some values for setup speaker_embedding, could you help me figure out this? When I check out yourTTS colab program, I understand that we dont need to fine-tuning with new voice, we only calculate the speaker embedding of

Hi @hathubkhn, It is right, YourTTS can produce a new voice without the model being trained in the target voice. During the training, the YourTTS model is conditioned with speaker embeddings extracted from a speaker encoder (trained with thousand of speakers). The speaker encoder can generalize good embeddings for new speakers and the YourTTS model which is conditioned with these embeddings can generate new voices as well. If you are interested in understanding how this works @WeberJulian did a Youtube video about the YourTTS model and you can watch it here. In addition, have my talk at Nvidia’s AI submit that you can access here.

0reactions

hathubkhncommented, Dec 13, 2022

@hathubkhn Recently, we created a recipe that makes everything easier. If you like you can try to fine-tune the model with this recipe.

The recipe replicates the first experiment proposed in the YourTTS paper. The recipe replicates the single language training using the VCTK dataset (it downloads, resamples, and extracts the speaker embeddings automatically 😃). However, if you are interested in multilingual training, we have commented on parameters on the VitsArgs class instance that should be enabled for multilingual training: https://github.com/coqui-ai/TTS/blob/dev/recipes/vctk/yourtts/train_yourtts.py

Hello Edresson, I am doing on Korean languages, so could I use the way like:

Training VITS with training data of one single woman speaker (~3 hours long) => save checkpoint
Create a model with saved checkpoint, and re-trained with one men speaker (~17min long)