Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

help request. trying to figure out how to match up params for TTS to Vocoder.

See original GitHub issue

I’m using a fork of https://github.com/Tomiinek/Multilingual_Text_to_Speech as the project https://github.com/CherokeeLanguage/Cherokee-TTS.

The TTS project I’m using shows the below for audio params, but I don’t know what to change in either the TTS params or the vocoder params to have them match up. I’m guessing the hop_samples somehow matches up with the sftp_* settings, but, am a bit clueless as to what I’m looking at. I’m thinking it would be good start to adjust the vocoder settings and train on the domain specific voices being used in the Tacotron training.

TTS Tacotron Settings

    sample_rate = 22050                  # sample rate of source .wavs, used while computing spectrograms, MFCCs, etc.
    num_fft = 1102                       # number of frequency bins used during computation of spectrograms
    num_mels = 80                        # number of mel bins used during computation of mel spectrograms
    num_mfcc = 13                        # number of MFCCs, used just for MCD computation (during training)
    stft_window_ms = 50                  # size in ms of the Hann window of short-time Fourier transform, used during spectrogram computation
    stft_shift_ms = 12.5                 # shift of the window (or better said gap between windows) in ms

diffwave Vocoder Settings

# Data params
    sample_rate=22050,
    n_mels=80,
    n_fft=1024,
    hop_samples=256,

Issue Analytics

State:
Created 2 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

sharvilcommented, Nov 17, 2021

This happens because the spectrogram frames are upsampled by a factor of 256, and not 275 (your new hop size). Here’s how you can change the upsampling module to go up to a factor of 275:

class SpectrogramUpsampler(nn.Module):
  def __init__(self, n_mels):
    super().__init__()
    self.conv1 = ConvTranspose2d(1, 1, [3, 22], stride=[1, 11], padding=[1, 6], output_padding=[0, 1])
    self.conv2 = ConvTranspose2d(1, 1,  [3, 50], stride=[1, 25], padding=[1, 13], output_padding=[0, 1])

0reactions

sharvilcommented, Nov 24, 2021

Sounds like training is progressing as expected. The training loss for this generation of diffusion models has pretty high variance because of the noise schedule sampling procedure so don’t let the fluctuation deter you. The model typically improves even when it looks like the loss has flattened out.

Given that you’re training a multi-speaker model, I recommend training on all speakers for a large number of iterations/epochs, and then fine-tuning on individual speakers if the multi-speaker model isn’t good enough.

Top Results From Across the Web

Universal ParallelWaveGAN · Issue #501 · mozilla/TTS - GitHub

Hi, as Eren requested, this is an issue to follow progress of the training a larger PWGAN model for multiple speakers.

Text To Speech — Foundational Knowledge (Part 2)

This sider web figure above clearly denotes the field of Synthetic Text To Speech (TTS) utilizing neural networks has been exploding in terms...

Four of the Most Common Synthetic Speech Problems and ...

Synthetic Speech Problems and Their Solutions. 1. Pronunciation errors. There are two main types of pronunciation errors made by synthetic ...

Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech ...

Abstract. We propose Nix-TTS, a lightweight neural TTS (Text-to-Speech) model achieved by applying knowledge distillation to a powerful yet large-sized ...

Investigations on speaker adaptation using a continuous ...

This paper presents an investigation of speaker adaptation using a continuous vocoder for parametric text-to-speech (TTS) synthesis.