Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TTS Tacotron 2, "Weak" Alignment, any suggestions?

See original GitHub issue

So i am trying to train Tacotron 2 on some custom dataset, i have a single speaker dataset, that is roughly around 11 hours.

I have trained other implementations of tacotron 1 before, and on this one implementation the alignment learnt was very good.

ESPNet though for some reason learns alignment, but its a bit “weak” meaning at every timestamp the predicted phonemes are a bit off sometimes.

I am training this on 6 GPUs with batch size 32. I trained libritts as well from the default recipes, and according the the config, the model learnt very good alignment in only 30 epochs. But as can be seen from the GIF below, the alignment does get better, but it takes 800 epochs, and still its somewhat weak.

Can anyone give suggestions on what the problem could be, or what i could do to make things better, any help would be greatly appreciated.

EDIT:::

To not mess with the specifications alot, i have only changed the sampling rate in the config to match my datas SR, all other parameters like n_mels, nfft, etc remain the same. Could this be an issue? Should i resample my data to 24000 to match libritts specifications and try training the model again? My params are as follows

fs=16000      # sampling frequency
fmax=""       # maximum frequency
fmin=""       # minimum frequency
n_mels=80     # number of mel basis
n_fft=1024    # number of fft points
n_shift=256   # number of shift points
win_length="" # window length

Additionally this is what my melspectograms look like from feats.scp and feats.ark

random feats.scp feat.scp

random feats.ark

This is very different from what libritts features looked like, but my limited knowledge in signal processing does not help me identify what the problem could be

Alignment spochs 280-870

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:35 (20 by maintainers)

Top GitHub Comments

2reactions

Imtinan1996commented, Nov 18, 2019

i think the main difference was the silence trimming, since the aforementioned repo also utilizes silence trimming.

Thank you both for all your help.

Also, can we please keep the issue open for now, i will close it once i am able to post the final results.

1reaction

Imtinan1996commented, Nov 18, 2019

UPDATE:

This is the alignment after 125 epochs, its too early to make a blatant statement but the progress looks healthy, i will be sure to keep you guys updated

espnet_alignment