Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Usage] VITS finetuning

See original GitHub issue

Hi,

First of all, thanks for this incredible work, I hope to be able to contribute soon !

I’m trying to finetune VITS using espeak VCTK 44100 version, but after 245 epochs, the generated audio has serious “early stopping” issues. When I try to generate some audio using resultng checkpoint, the generated audio stops more or less after the first phoneme. I only use sentences tested with the pretrained model.

I updated VCTK dataset to add more speakers from my private dataset (n = 239, with old VCTK speakers). But I can’t figure if I have a problem with my dataset, or with Espnet train config. I probably have some issues with my .lab files, can this be the source of my bad train ?

Here is the command used in stage 6 for training :

./tts.sh --lang en --feats_type raw --fs 48000 --n_fft 2048 --n_shift 300 --win_length 1200 --token_type phn --cleaner tacotron --g2p g2p_en_no_space --train_config conf/train.yaml --inference_config conf/decode.yaml --train_set tr_no_dev --valid_set dev --test_sets 'dev eval1' --srctexts data/tr_no_dev/text --audio_format wav --train_set tr_no_dev_phn --valid_set dev_phn --test_sets 'dev_phn eval1_phn' --srctexts data/tr_no_dev_phn/text --g2p none --cleaner none --stage 6 --use_sid true --min_wav_duration 0.38 --ngpu 1 --n_shift 512 --dumpdir dump/44k --expdir exp/44k --tts_task gan_tts --feats_extract linear_spectrogram --feats_normalize none --train_config ./conf/train.yaml --inference_model train.total_count.ave.pth --train_args '--init_param downloads/c958873c3aa8d54124819460626cf9d7/exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_espeak_ng_english_us_vits/train.total_count.ave_5best.pth:tts:tts:tts.generator.global_emb.weight' --tag finetune_vits_espeak --stage 6

Is this command correct ? I’m really not sure about --init_param usage, but tried to follow documentation instructions (from jvs recipe mostly). I saw in another issue that i should get some intelligible results after few epochs using finetuning setup. Is this correct ?

Issue Analytics

State:
Created 2 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

2reactions

lheuvelinecommented, Nov 18, 2021

After diving deep into Espnet, looking carefully at attention at inference between my model and the pretrained one, I found that the source of my problem was a phoneme duration issue. (output_dict[“duration”] has a size of 1)

I tried to freeze stochastic duration predictor and also triede to continue train on VCTK, and found the same issue.

In fact, I forgot that the model with espeak is directly trained on phonemes and not on raw text ! When I correctly feed phonemes as input (or overwrite preprocess_fn after instanciation), the duration prediction recovered correct sizes.

Thanks a lot for your insights, I am now able to work on new speakers integration !

0reactions

lheuvelinecommented, Dec 2, 2021

@skol101 I tried to use deepvoice3_pytorch preprocessing scripts, but I finally directly used https://github.com/lowerquality/gentle, which is used by deepvoice3_pytorch to generate lab files, if I remember correctly. It’s quite slow, but you can easily run Gentle in docker (https://hub.docker.com/r/lowerquality/gentle/) and get the process run in the background.