Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fine-tuning/Speaker adaptation

See original GitHub issue

First off, this is great work. Can’t wait to play around with the code 👍

In the training instructions, I see that you do have multispeaker support. Is it possible to “fine-tune” from an existing checkpoint with another dataset using --resume? Has anyone tried it and see if the results are good?

Issue Analytics

State:
Created 5 years ago
Comments:15 (9 by maintainers)

Top GitHub Comments

2reactions

tiberiu44commented, Sep 26, 2018

Hi @ZohaibAhmed ,

That is a good question. You have two components to consider:

The Vocoder: you can easily add more/different data and adapt a pre-trained model. In fact, I did that initially. The Vocoder was trained on a single speaker (female voice). By just training a couple of epochs, I got another speaker (male voice) working. The new model did not work on the original voice as I did before (the results got sloppier). However, I’m doing multi-speaker training using the SWARA corpus and it seems that the new Vocoder model is able to synthesize data from new speakers right out-of-the-box (since it is conditioned on the Spectogram and now sees a lot of variance during training).

The text encoder: This a little more difficult to adapt at the moment. The model contains a lookup table with speaker identities. If you want to add data from another speaker, you have to replace the “identity” of an existing speaker by modifying the LAB files.

There is another way to do multi-speaker training, which would actually allow for speaker adaptation on the fly. I think you would only require a couple of minutes of speech, to actually produce someone’s speech. The idea is to train a network to discriminate between speakers. Given two audio sequences, it would have to guess if the samples belong to the same speaker or not. Than, the idea is to use the latent representations as conditioning vectors for the Text Encoder, instead of using trainable speaker embeddings. This would actually make speaker adaptation a matter of finding the “right” latent representation. It could possibly work just by running the voice through the speaker identification network.

I’m currently working on the SparseLSTM implementation, which should make the TTS work in real-time. If you are interested in developing a little more around speaker adaptation, contributions are welcomed 👍

1reaction

tiberiu44commented, Nov 27, 2018

It’s real-time on a I7 CPU. The I3 is 4 times slower. Also, I have to mention that I obtained nice results using gold-standard data, but end-to-end speech synthesis created a lot of artifacts. This model is not tolerant to noise added by the text encoder which predicts the mel-spectogram. I’m trying something different at the moment and hopefully I will get real-time speech synthesis with high quality.