Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GravesAttention with Tacotron 1 yields empty alignment plots during training and throwns no attribute error during inference

See original GitHub issue

I’ve trained a model using T1 with GST and GravesAttention. During training, all training and eval alignment plots have been empty (trained 80k+ steps). The model produced audio in the tensorboards, however using the logic from one of the notebooks to evaluate a model and synthesize speech, it threw me the following error: AttributeError: 'GravesAttention' object has no attribute 'init_win_idx'. referring to layers/tacotorn.py --> 478 self.attention.init_win_idx(). I suspect maybe that the Tacotron 1 model is not configured to use GravesAttention because some of those methods defined in layers/tacotron.py do not exist in the GravesAttention class.

Config:

{ “model”: “Tacotron”, “run_name”: “blizzard-gts”, “run_description”: “tacotron GST.”, “audio”: {

"fft_size": 1024, 
"win_length": 1024, 
"hop_length": 256, 
"frame_length_ms": null, 
"frame_shift_ms": null, 


"sample_rate": 24000, 
"preemphasis": 0.0, 
"ref_level_db": 20, 


"do_trim_silence": true, 
"trim_db": 60, 


"power": 1.5, 
"griffin_lim_iters": 60, 


"num_mels": 80, 
"mel_fmin": 95.0, 
"mel_fmax": 12000.0, 
"spec_gain": 20,


"signal_norm": true, 
"min_level_db": -100, 
"symmetric_norm": true, 
"max_norm": 4.0, 
"clip_norm": true, 
"stats_path": null

“distributed”: { “backend”: “nccl”, “url”: “tcp://localhost:54321” },

“reinit_layers”: [],

“batch_size”: 128, “eval_batch_size”: 16, “r”: 7, “gradual_training”: [ [0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 32], [290000, 1, 32] ], “mixed_precision”: true,

“loss_masking”: false, “decoder_loss_alpha”: 0.5, “postnet_loss_alpha”: 0.25, “postnet_diff_spec_alpha”: 0.25, “decoder_diff_spec_alpha”: 0.25, “decoder_ssim_alpha”: 0.5, “postnet_ssim_alpha”: 0.25, “ga_alpha”: 5.0, “stopnet_pos_weight”: 15.0,

“run_eval”: true, “test_delay_epochs”: 10, “test_sentences_file”: null,

“noam_schedule”: false, “grad_clip”: 1.0, “epochs”: 300000, “lr”: 0.0001, “wd”: 0.000001, “warmup_steps”: 4000, “seq_len_norm”: false,

“memory_size”: -1, “prenet_type”: “original”, “prenet_dropout”: true,

“attention_type”: “graves”, “attention_heads”: 4, “attention_norm”: “sigmoid”, “windowing”: false, “use_forward_attn”: false, “forward_attn_mask”: false, “transition_agent”: false, “location_attn”: true, “bidirectional_decoder”: false, “double_decoder_consistency”: false, “ddc_r”: 7,

“stopnet”: true, “separate_stopnet”: true,

“print_step”: 25, “tb_plot_step”: 100, “print_eval”: false, “save_step”: 5000, “checkpoint”: true, “tb_model_param_stats”: false,

“text_cleaner”: “phoneme_cleaners”, “enable_eos_bos_chars”: false, “num_loader_workers”: 8, “num_val_loader_workers”: 8, “batch_group_size”: 4, “min_seq_len”: 6, “max_seq_len”: 153, “compute_input_seq_cache”: false, “use_noise_augment”: true,

“output_path”: “/home/big-boy/Models/Blizzard/”,

“phoneme_cache_path”: “/home/big-boy/Models/phoneme_cache/”, “use_phonemes”: true, “phoneme_language”: “en-us”,

“use_speaker_embedding”: false, “use_gst”: true, “use_external_speaker_embedding_file”: false, “external_speaker_embedding_file”: “…/…/speakers-vctk-en.json”, “gst”: { “gst_style_input”: null, “gst_embedding_dim”: 512, “gst_num_heads”: 4, “gst_style_tokens”: 10, “gst_use_speaker_embedding”: false },

“datasets”: [{ “name”: “ljspeech”, “path”: “/Data/blizzard2013/segmented/”, “meta_file_train”: “metadata.csv”, “meta_file_val”: null }] }

Alignment plots:

Issue Analytics

State:
Created 3 years ago
Comments:11 (11 by maintainers)

Top GitHub Comments

1reaction

erogolcommented, Apr 29, 2021

Good to hear that but I am personally not sure if the implementation is right comparing to this paper https://arxiv.org/abs/1910.10288

AFAIK this is the most robust Graves attention so far proposed for TTS. It may be wrong.

Itd be nice if you could double check.

0reactions

a-froghyarcommented, May 17, 2021

Closing this because the no attribute bug was fixed in https://github.com/coqui-ai/TTS/pull/479 and GMM (Graves) Attention will be looked at in a separate discussion.

Top Results From Across the Web

Regotron: Regularizing the Tacotron2 architecture via ... - arXiv

Our method augments the vanilla Tacotron2 objective function with an additional term, which penalizes non-monotonic alignments in the ...

Gated Recurrent Attention for Multi-Style Speech Synthesis

The attention alignment plots during the early stage of training the Tacotron2-GST with guided attention and decaying guided attention can be found in...

FPETS : Fully Parallel End-to-End Text-to-Speech System

In this paper, we propose a novel non-autoregressive, fully parallel end-to-end TTS system (FPETS). It utilizes a new alignment model and the recently ......

Transfer Learning in Speech Synthesis - UEF eRepo

This paper is an attempt to work towards a unified neural network approach for the issue of speaker adaptation. In recent years, machine ......

Effective and direct control of neural TTS prosody by removing ...

attribute as a random variable in the latent space. ... speech data for training a neural TTS is usually expensive and time consuming....