question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

I get random mel outputs when I training Tacotron-2 from scratch with LJSpeech dataset

See original GitHub issue

I use the most recent code and the standard LJSpeech dataset. When i inference the model, the model generates random mel outputs after the correct voice in most times.

I use the code for inference from Huggingface Hub:

import soundfile as sf
import numpy as np

import tensorflow as tf

from tensorflow_tts.inference import AutoProcessor
from tensorflow_tts.inference import AutoConfig
from tensorflow_tts.inference import TFAutoModel

pretrained_mapper = "./tensorflow_tts/processor/pretrained/ljspeech_mapper.json"
pretrained_model_config = "./examples/tacotron2/conf/tacotron2.v1.yaml"
pretrained_model = "./examples/tacotron2/exp/train.tacotron2.v1/checkpoints/model-70000.h5"
pretrained_generator_config = "./examples/multiband_melgan/conf/multiband_melgan.v1.yaml"
pretrained_generator = "./examples/multiband_melgan/pretrained/tts-mb_melgan-ljspeech-en.h5"
pretrained_model_sampling_rate = 22050

processor = AutoProcessor.from_pretrained(pretrained_mapper)
model_config = AutoConfig.from_pretrained(pretrained_model_config)
tacotron2 = TFAutoModel.from_pretrained(pretrained_model, model_config)
generator_config = AutoConfig.from_pretrained(pretrained_generator_config)
mb_melgan = TFAutoModel.from_pretrained(pretrained_generator, generator_config)

input_text = "This is a demo to show how to use our model to generate mel spectrogram from raw text."
print("input_text>>>>", input_text)

input_ids = processor.text_to_sequence(input_text)
print("phoneme seq: {}".format(input_ids))

# tacotron2 inference (text-to-mel)
decoder_output, mel_outputs, stop_token_prediction, alignment_history = tacotron2.inference(
    input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
    input_lengths=tf.convert_to_tensor([len(input_ids)], tf.int32),
    speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),
)

print(type(mel_outputs))
print(mel_outputs.shape)

# melgan inference (mel-to-wav)
audio = mb_melgan.inference(mel_outputs)[0, :, 0]

# save to file
sf.write('./audio.wav', audio, pretrained_model_sampling_rate, "PCM_16")

After the inference, the code will print mel_outputs.shape like this:

(1, 470, 80)

If the middle number is in range of 400-600, it will be a good result with no random data. I tested the models trained with 10k - 70k iters, and collected all the middle numbers listed below. It shows that the models are not so reliable.

Is my inference code wrong or what happened to the models training.

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 12"> <link id=Main-File rel=Main-File href="file:///C:\Users\Faye\AppData\Local\Temp\msohtmlclip1\01\clip.htm"> <link rel=File-List href="file:///C:\Users\Faye\AppData\Local\Temp\msohtmlclip1\01\clip_filelist.xml"> <style> </style> </head> <body link=blue vlink=purple>

| Round 1 | Round 2 | Round 3 | Result – | – | – | – | – 10k | 2372 | 3966 | 4000 | Random 20k | 502 | 578 | 489 | Pass 30k | 4000 | 533 | 3518 | Random 40k | 552 | 511 | 547 | Pass 50k | 4000 | 4000 | 4000 | Random 60k | 498 | 525 | 4000 | Random 61k | 494 | 515 | 528 | Pass 62k | 518 | 4000 | 484 | Random 63k | 4000 | 4000 | 468 | Random 64k | 497 | 4000 | 4000 | Random 65k | 4000 | 4000 | 4000 | Random 66k | 509 | 484 | 4000 | Random 67k | 4000 | 491 | 3566 | Random 68k | 484 | 481 | 476 | Pass 69k | 4000 | 4000 | 505 | Random 70k | 470 | 476 | 492 | Pass

</body> </html>

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9

github_iconTop GitHub Comments

1reaction
ttskingcommented, Aug 16, 2021

Update: I modified the batch_size from 32 to 16. I tested the models of 10k, 15k, 20k, 25k, 30k. It looks that the quality of the model is more predictable. I think i can close this issue.

0reactions
ttskingcommented, Aug 11, 2021

@dathudeptrai I tested window 3 and 1, looks no improvements: https://drive.google.com/drive/folders/14ckSy9a4xBu29opPaQ-OsuWx4DqyvfKK?usp=sharing

Does it related to the training process. I use nvidia’s docker 21.03-tf2-py3 with one 3090 GPU. And later updated to 21.06-tf2-py3. The TensorBoard of training looks fine. But i got this random issue in inference. 6-8-2021_211540_localhost

I trained Tacotron2 with Baker dataset (Chinese language) in couple weeks ago. It’s all fine in training and inference.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tacotron2 produces random mel outputs during inference ...
I have trained tacotron2 for 52k steps on the SynPaFlex french dataset. ... random mel outputs during inference (french dataset) #581.
Read more >
Training Your Own Voice Font Using Flowtron
Learning from our training experience, you can train Flowtron from scratch with your own dataset if you have 10+ hours of data per...
Read more >
Using Tacotron 2 To Generate Natural Human Speech
The output will be a Mel-spectrogram — a low-level representation obtained by applying a fast Fourier transform to a discrete audio signal. We ......
Read more >
speechbrain/tts-tacotron2-ljspeech - Hugging Face
The pre-trained model takes in input a short text and produces a spectrogram in output. One can get the final waveform by applying...
Read more >
Transfer Learning, Style Control, and Speaker Reconstruction ...
We propose a novel training strategy for zero-shot multilin- gual multi-speaker TTS in the low-resource target domain by utilizing pre-trained ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found