GravesAttention with Tacotron 1 yields empty alignment plots during training and throwns no attribute error during inference
See original GitHub issueI’ve trained a model using T1 with GST and GravesAttention. During training, all training and eval alignment plots have been empty (trained 80k+ steps). The model produced audio in the tensorboards, however using the logic from one of the notebooks to evaluate a model and synthesize speech, it threw me the following error: AttributeError: 'GravesAttention' object has no attribute 'init_win_idx'.
referring to layers/tacotorn.py
--> 478 self.attention.init_win_idx()
. I suspect maybe that the Tacotron 1 model is not configured to use GravesAttention because some of those methods defined in layers/tacotron.py
do not exist in the GravesAttention
class.
Config:
{ “model”: “Tacotron”, “run_name”: “blizzard-gts”, “run_description”: “tacotron GST.”, “audio”: {
"fft_size": 1024,
"win_length": 1024,
"hop_length": 256,
"frame_length_ms": null,
"frame_shift_ms": null,
"sample_rate": 24000,
"preemphasis": 0.0,
"ref_level_db": 20,
"do_trim_silence": true,
"trim_db": 60,
"power": 1.5,
"griffin_lim_iters": 60,
"num_mels": 80,
"mel_fmin": 95.0,
"mel_fmax": 12000.0,
"spec_gain": 20,
"signal_norm": true,
"min_level_db": -100,
"symmetric_norm": true,
"max_norm": 4.0,
"clip_norm": true,
"stats_path": null
},
“distributed”: { “backend”: “nccl”, “url”: “tcp://localhost:54321” },
“reinit_layers”: [],
“batch_size”: 128, “eval_batch_size”: 16, “r”: 7, “gradual_training”: [ [0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 32], [290000, 1, 32] ], “mixed_precision”: true,
“loss_masking”: false, “decoder_loss_alpha”: 0.5, “postnet_loss_alpha”: 0.25, “postnet_diff_spec_alpha”: 0.25, “decoder_diff_spec_alpha”: 0.25, “decoder_ssim_alpha”: 0.5, “postnet_ssim_alpha”: 0.25, “ga_alpha”: 5.0, “stopnet_pos_weight”: 15.0,
“run_eval”: true, “test_delay_epochs”: 10, “test_sentences_file”: null,
“noam_schedule”: false, “grad_clip”: 1.0, “epochs”: 300000, “lr”: 0.0001, “wd”: 0.000001, “warmup_steps”: 4000, “seq_len_norm”: false,
“memory_size”: -1, “prenet_type”: “original”, “prenet_dropout”: true,
“attention_type”: “graves”, “attention_heads”: 4, “attention_norm”: “sigmoid”, “windowing”: false, “use_forward_attn”: false, “forward_attn_mask”: false, “transition_agent”: false, “location_attn”: true, “bidirectional_decoder”: false, “double_decoder_consistency”: false, “ddc_r”: 7,
“stopnet”: true, “separate_stopnet”: true,
“print_step”: 25, “tb_plot_step”: 100, “print_eval”: false, “save_step”: 5000, “checkpoint”: true, “tb_model_param_stats”: false,
“text_cleaner”: “phoneme_cleaners”, “enable_eos_bos_chars”: false, “num_loader_workers”: 8, “num_val_loader_workers”: 8, “batch_group_size”: 4, “min_seq_len”: 6, “max_seq_len”: 153, “compute_input_seq_cache”: false, “use_noise_augment”: true,
“output_path”: “/home/big-boy/Models/Blizzard/”,
“phoneme_cache_path”: “/home/big-boy/Models/phoneme_cache/”, “use_phonemes”: true, “phoneme_language”: “en-us”,
“use_speaker_embedding”: false, “use_gst”: true, “use_external_speaker_embedding_file”: false, “external_speaker_embedding_file”: “…/…/speakers-vctk-en.json”, “gst”: { “gst_style_input”: null, “gst_embedding_dim”: 512, “gst_num_heads”: 4, “gst_style_tokens”: 10, “gst_use_speaker_embedding”: false },
“datasets”: [{ “name”: “ljspeech”, “path”: “/Data/blizzard2013/segmented/”, “meta_file_train”: “metadata.csv”, “meta_file_val”: null }] }
Alignment plots:
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (11 by maintainers)
Top GitHub Comments
Good to hear that but I am personally not sure if the implementation is right comparing to this paper https://arxiv.org/abs/1910.10288
AFAIK this is the most robust Graves attention so far proposed for TTS. It may be wrong.
Itd be nice if you could double check.
Closing this because the no attribute bug was fixed in https://github.com/coqui-ai/TTS/pull/479 and GMM (Graves) Attention will be looked at in a separate discussion.