[Usage] VITS finetuning
See original GitHub issueHi,
First of all, thanks for this incredible work, I hope to be able to contribute soon !
I’m trying to finetune VITS using espeak VCTK 44100 version, but after 245 epochs, the generated audio has serious “early stopping” issues. When I try to generate some audio using resultng checkpoint, the generated audio stops more or less after the first phoneme. I only use sentences tested with the pretrained model.
I updated VCTK dataset to add more speakers from my private dataset (n = 239, with old VCTK speakers). But I can’t figure if I have a problem with my dataset, or with Espnet train config. I probably have some issues with my .lab files, can this be the source of my bad train ?
Here is the command used in stage 6 for training :
./tts.sh --lang en --feats_type raw --fs 48000 --n_fft 2048 --n_shift 300 --win_length 1200 --token_type phn --cleaner tacotron --g2p g2p_en_no_space --train_config conf/train.yaml --inference_config conf/decode.yaml --train_set tr_no_dev --valid_set dev --test_sets 'dev eval1' --srctexts data/tr_no_dev/text --audio_format wav --train_set tr_no_dev_phn --valid_set dev_phn --test_sets 'dev_phn eval1_phn' --srctexts data/tr_no_dev_phn/text --g2p none --cleaner none --stage 6 --use_sid true --min_wav_duration 0.38 --ngpu 1 --n_shift 512 --dumpdir dump/44k --expdir exp/44k --tts_task gan_tts --feats_extract linear_spectrogram --feats_normalize none --train_config ./conf/train.yaml --inference_model train.total_count.ave.pth --train_args '--init_param downloads/c958873c3aa8d54124819460626cf9d7/exp/tts_train_full_band_multi_spk_vits_raw_phn_tacotron_espeak_ng_english_us_vits/train.total_count.ave_5best.pth:tts:tts:tts.generator.global_emb.weight' --tag finetune_vits_espeak --stage 6
Is this command correct ? I’m really not sure about --init_param usage, but tried to follow documentation instructions (from jvs recipe mostly). I saw in another issue that i should get some intelligible results after few epochs using finetuning setup. Is this correct ?
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (2 by maintainers)
Top GitHub Comments
After diving deep into Espnet, looking carefully at attention at inference between my model and the pretrained one, I found that the source of my problem was a phoneme duration issue. (output_dict[“duration”] has a size of 1)
I tried to freeze stochastic duration predictor and also triede to continue train on VCTK, and found the same issue.
In fact, I forgot that the model with espeak is directly trained on phonemes and not on raw text ! When I correctly feed phonemes as input (or overwrite preprocess_fn after instanciation), the duration prediction recovered correct sizes.
Thanks a lot for your insights, I am now able to work on new speakers integration !
@skol101 I tried to use deepvoice3_pytorch preprocessing scripts, but I finally directly used https://github.com/lowerquality/gentle, which is used by deepvoice3_pytorch to generate lab files, if I remember correctly. It’s quite slow, but you can easily run Gentle in docker (https://hub.docker.com/r/lowerquality/gentle/) and get the process run in the background.