Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tacotron2 produces random mel outputs during inference (french dataset)

See original GitHub issue

Hi ! I have trained tacotron2 for 52k steps on the SynPaFlex french dataset. I deleted sentences longer than 20 seconds from the dataset and ended up with around 30 hours of single speaker data.

I made a custom synpaflex.py processor in ./tensorflow_tts/processor/ with these symbols (adapted to french without arpabet) :

_pad = "pad"
_eos = "eos"
_punctuation = "!/\'(),-.:;? "
_letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzéèàùâêîôûçäëïöüÿœæ"

# Export all symbols:
SYNPAFLEX_SYMBOLS = (
    [_pad] + list(_punctuation) + list(_letters) + [_eos]
)

I used basic_cleaners for text cleaning.

in #182 the issue was similar, but the problem came from using tacotron2.v1.yaml as configuration file. I am using my own tacotron2.synpaflex.v1.yaml for both training and inference.

During synthesis, mel outputs are completely random : the output is different even if the sentence is kept the exact same. The audio signals sound like a french version of the WaveNet examples where no text has been provided during training, in the “Knowing What to Say” section of this page.

Here are my tensorboard results :

I must be doing something wrong somehow as I have been able to train on LJSpeech successfuly… Any idea ?

Issue Analytics

State:
Created 2 years ago
Comments:41 (19 by maintainers)

Top GitHub Comments

2reactions

ttslrcommented, Jul 8, 2021

@ttslr Hi, seems you are an expert in this field 😄. I saw you have a lot of papers about TTS 😄

Thank you! I’m just an ordinary TTS researcher. 😃 😄

1reaction

samuel-luniicommented, Jun 18, 2021

@ihshareef Not yet, still investigating ! I will let everyone know when I find the solution.

Top Results From Across the Web

Grad-TTS: A Diffusion Probabilistic Model for Text-to ... - arXiv

for flexible inference schemes. In this paper we introduce Grad-TTS, a novel text-to-speech model with score-based decoder producing mel-.

Tacotron 2 - PyTorch

The Tacotron 2 model produces mel spectrograms from input text using encoder-decoder architecture. WaveGlow (also available via torch.hub) is a flow-based ...

Towards Robust Neural Vocoding for Speech Generation

The Mel-spectrogram output of the Tacotron 2 is fed to the vocoders pretrained with different datasets. For Griffin-Lim algorithm, Mel-spectrograms are ...

(PDF) Es-Tacotron2: Multi-Task Tacotron 2 with Pre-Trained ...

Suffering from an over-smoothness problem, Tacotron 2 produced 'averaged' speech, making the synthesized speech sounds unnatural and inflexible.

BIDIRECTIONAL VARIATIONAL INFERENCE FOR NON ...

In experiments conducted on LJSpeech dataset, we show that our model generates a mel-spectrogram 27 times faster than Tacotron. 2 with similar speech...