question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fine-tuning/Speaker adaptation

See original GitHub issue

First off, this is great work. Can’t wait to play around with the code 👍

In the training instructions, I see that you do have multispeaker support. Is it possible to “fine-tune” from an existing checkpoint with another dataset using --resume? Has anyone tried it and see if the results are good?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:15 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
tiberiu44commented, Sep 26, 2018

Hi @ZohaibAhmed ,

That is a good question. You have two components to consider:

The Vocoder: you can easily add more/different data and adapt a pre-trained model. In fact, I did that initially. The Vocoder was trained on a single speaker (female voice). By just training a couple of epochs, I got another speaker (male voice) working. The new model did not work on the original voice as I did before (the results got sloppier). However, I’m doing multi-speaker training using the SWARA corpus and it seems that the new Vocoder model is able to synthesize data from new speakers right out-of-the-box (since it is conditioned on the Spectogram and now sees a lot of variance during training).

The text encoder: This a little more difficult to adapt at the moment. The model contains a lookup table with speaker identities. If you want to add data from another speaker, you have to replace the “identity” of an existing speaker by modifying the LAB files.

There is another way to do multi-speaker training, which would actually allow for speaker adaptation on the fly. I think you would only require a couple of minutes of speech, to actually produce someone’s speech. The idea is to train a network to discriminate between speakers. Given two audio sequences, it would have to guess if the samples belong to the same speaker or not. Than, the idea is to use the latent representations as conditioning vectors for the Text Encoder, instead of using trainable speaker embeddings. This would actually make speaker adaptation a matter of finding the “right” latent representation. It could possibly work just by running the voice through the speaker identification network.

I’m currently working on the SparseLSTM implementation, which should make the TTS work in real-time. If you are interested in developing a little more around speaker adaptation, contributions are welcomed 👍

1reaction
tiberiu44commented, Nov 27, 2018

It’s real-time on a I7 CPU. The I3 is 4 times slower. Also, I have to mention that I obtained nice results using gold-standard data, but end-to-end speech synthesis created a lot of artifacts. This model is not tolerant to noise added by the text encoder which predicts the mel-spectogram. I’m trying something different at the moment and hopefully I will get real-time speech synthesis with high quality.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Continued finetuning as single speaker adaptation - DiVA Portal
In this paper, we investigate if continuing the fine-tuning of such a model is suitable as a method of speaker adaptation for a...
Read more >
Residual Adapters for Few-Shot Text-to-Speech Speaker ...
Abstract: Adapting a neural text-to-speech (TTS) model to a target speaker typically involves fine-tuning most if not all of the parameters ...
Read more >
Speaker adaptation for Wav2vec2 based dysarthric ASR
The model adaptation (MA) is done by fine-tuning all the model parameters of DNN-HMM or LF-MMI in two stages: During stage 1, the...
Read more >
Adapting TTS models For New Speakers using Transfer ...
In this finetuning approach, we finetune all the parameters of the pre-trained TTS models directly on the data of the new speaker. For...
Read more >
SPEAKER ADAPTATION FOR END-TO-END CTC MODELS ...
adaptation is to fine tuning the SI model with speaker adaptation data using Eq. (1). Compared to letter units, the number of words...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found