Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tacotron2 inference fails, but training evaluation seems fine

See original GitHub issue

I’m training a Tacotron2 to perform duration extraction so that I can train FastSpeech2. I have a 5-hour custom non-English dataset that I prepared like LJSpeech (used default preprocessing). I’m training from scratch.

The alignments and the spectrograms generated during training seem fine, and so does the tensorboard outputs, as can be seen as follows: evaluation_alignment The spectrograms, though from 1,500 steps, have not deteriorated at 16,000 steps. evaluation_spectrogram tensorboard_values

However, I get terrible results when I download a model (16,000 iterations) and perform inference. I created a Google Colab notebook for inference: https://colab.research.google.com/drive/1JLkncs27HaMo7dj05-T4cNW4u2nYAVWv?usp=sharing

The alignments and the spectrogram produced are as follows: inference_alignment melspectrogram

In the AutoProcessor, I used the ljspeech_mapper.json that I got from the dump_ljspeech folder. I used the tacotron2.v1.yml file that I found in the examples folder, as well.

As per what I understand from here, should I ignore this and keep training until 50k and then use the model to extract durations for FastSpeech2?

Issue Analytics

State:
Created 2 years ago
Comments:5

Top GitHub Comments

1reaction

dathudeptraicommented, Jun 18, 2021

Just as an update: I guess that the dataset that I had used (~5,000 speech samples, total of 8 hours) was insufficient for the Tacotron to learn properly. I used the pretrained LJSpeech model and fine-tuned it on my dataset, and now I’m able to generate speech from Tacotron without using teacher-forcing.

Thanks for the report 😄, it’s a good new since our pretrained models are valuable 🗡️

0reactions

ihshareefcommented, Jun 18, 2021

Just as an update: I guess that the dataset that I had used (~5,000 speech samples, total of 8 hours) was insufficient for the Tacotron to learn properly. I used the pretrained LJSpeech model and fine-tuned it on my dataset, and now I’m able to generate speech from Tacotron without using teacher-forcing.

Top Results From Across the Web

Tacotron2: bad test synthesis results - TTS (Text-to-Speech)

Hello, I have tried a Tacotron2 training with a custom ... I guess that there is a very basic error causing this behavior...

Flowtron: an Autoregressive Flow-based Generative Network ...

In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with style transfer and speech variation.

Non-Attentive Tacotron: Robust and Controllable Neural TTS ...

This paper presents Non-Attentive Tacotron, a neural text-to-speech model based on Tacotron 2, but replacing the attention mechanism with an explicit ...

Adaptation of Tacotron2-based Text-To-Speech for Articulatory-to ...

In this paper, we experimented with transfer learning and adaptation of a Tacotron2 text-to-speech model to improve the final synthesis quality of ultrasound- ......

Decoding Knowledge Transfer for Neural Text-to-Speech ...

training and inference process in autoregressive models, remains an issue. ... The idea is to pre-train two Tacotron2 TTS teacher models in.