question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Usage] VITS finetuning

See original GitHub issue

Hi,

First of all, thanks for this incredible work, I hope to be able to contribute soon !

I’m trying to finetune VITS using espeak VCTK 44100 version, but after 245 epochs, the generated audio has serious “early stopping” issues. When I try to generate some audio using resultng checkpoint, the generated audio stops more or less after the first phoneme. I only use sentences tested with the pretrained model.

I updated VCTK dataset to add more speakers from my private dataset (n = 239, with old VCTK speakers). But I can’t figure if I have a problem with my dataset, or with Espnet train config. I probably have some issues with my .lab files, can this be the source of my bad train ?

Here is the command used in stage 6 for training :

./tts.sh --lang en --feats_type raw --fs 48000 --n_fft 2048 --n_shift 300 --win_length 1200 --token_type phn --cleaner tacotron --g2p g2p_en_no_space --train_config conf/train.yaml --inference_config conf/decode.yaml --train_set tr_no_dev --valid_set dev --test_sets 'dev eval1' --srctexts data/tr_no_dev/text --audio_format wav --train_set tr_no_dev_phn --valid_set dev_phn --test_sets 'dev_phn eval1_phn' --srctexts data/tr_no_dev_phn/text --g2p none --cleaner none --stage 6 --use_sid true --min_wav_duration 0.38 --ngpu 1 --n_shift 512 --dumpdir dump/44k --expdir exp/44k --tts_task gan_tts --feats_extract linear_spectrogram --feats_normalize none --train_config ./conf/train.yaml --inference_model train.total_count.ave.pth --train_args '--init_param downloads/c958873c3aa8d54124819460626cf9d7/exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_espeak_ng_english_us_vits/train.total_count.ave_5best.pth:tts:tts:tts.generator.global_emb.weight' --tag finetune_vits_espeak --stage 6

Is this command correct ? I’m really not sure about --init_param usage, but tried to follow documentation instructions (from jvs recipe mostly). I saw in another issue that i should get some intelligible results after few epochs using finetuning setup. Is this correct ?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
lheuvelinecommented, Nov 18, 2021

After diving deep into Espnet, looking carefully at attention at inference between my model and the pretrained one, I found that the source of my problem was a phoneme duration issue. (output_dict[“duration”] has a size of 1)

I tried to freeze stochastic duration predictor and also triede to continue train on VCTK, and found the same issue.

In fact, I forgot that the model with espeak is directly trained on phonemes and not on raw text ! When I correctly feed phonemes as input (or overwrite preprocess_fn after instanciation), the duration prediction recovered correct sizes.

Thanks a lot for your insights, I am now able to work on new speakers integration !

0reactions
lheuvelinecommented, Dec 2, 2021

@skol101 I tried to use deepvoice3_pytorch preprocessing scripts, but I finally directly used https://github.com/lowerquality/gentle, which is used by deepvoice3_pytorch to generate lab files, if I remember correctly. It’s quite slow, but you can easily run Gentle in docker (https://hub.docker.com/r/lowerquality/gentle/) and get the process run in the background.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[Usage] Fine-tuning VITS with VCTK-like dataset #3828 - GitHub
I meant memory usage at VITS inference -- the longer the input sentence, the higher memory consumption. Yes, longer sentence requires larger ...
Read more >
Fine-Tune ViT for Image Classification with Transformers
In this blog post, we'll walk through how to leverage datasets to download and process image classification datasets, and then use them to ......
Read more >
Fine-tuning a TTS model - TTS 0.10.0 documentation
Fine-tuning takes a pre-trained model, and retrains it to improve the model performance on a different task or dataset. In TTS we provide...
Read more >
[USAGE] fine tune vits · Discussion #4514 · espnet/espnet · GitHub
wav file in the LJSpeech directory with my cusom audio files - one sentence per audio-file, and start fine-tuning the ViTS model. I'm...
Read more >
Fine-tune ViT for image classification using HF Transformers ...
In this video I explain about how to Fine-tune Vision Transformers for ... 1.5K views 6 months ago Artificial Intelligence Use Cases.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found