help request. trying to figure out how to match up params for TTS to Vocoder.
See original GitHub issueI’m using a fork of https://github.com/Tomiinek/Multilingual_Text_to_Speech as the project https://github.com/CherokeeLanguage/Cherokee-TTS.
The TTS project I’m using shows the below for audio params, but I don’t know what to change in either the TTS params or the vocoder params to have them match up. I’m guessing the hop_samples somehow matches up with the sftp_* settings, but, am a bit clueless as to what I’m looking at. I’m thinking it would be good start to adjust the vocoder settings and train on the domain specific voices being used in the Tacotron training.
TTS Tacotron Settings
sample_rate = 22050 # sample rate of source .wavs, used while computing spectrograms, MFCCs, etc.
num_fft = 1102 # number of frequency bins used during computation of spectrograms
num_mels = 80 # number of mel bins used during computation of mel spectrograms
num_mfcc = 13 # number of MFCCs, used just for MCD computation (during training)
stft_window_ms = 50 # size in ms of the Hann window of short-time Fourier transform, used during spectrogram computation
stft_shift_ms = 12.5 # shift of the window (or better said gap between windows) in ms
diffwave Vocoder Settings
# Data params
sample_rate=22050,
n_mels=80,
n_fft=1024,
hop_samples=256,
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (5 by maintainers)
Top Results From Across the Web
Universal ParallelWaveGAN · Issue #501 · mozilla/TTS - GitHub
Hi, as Eren requested, this is an issue to follow progress of the training a larger PWGAN model for multiple speakers.
Read more >Text To Speech — Foundational Knowledge (Part 2)
This sider web figure above clearly denotes the field of Synthetic Text To Speech (TTS) utilizing neural networks has been exploding in terms...
Read more >Four of the Most Common Synthetic Speech Problems and ...
Synthetic Speech Problems and Their Solutions. 1. Pronunciation errors. There are two main types of pronunciation errors made by synthetic ...
Read more >Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech ...
Abstract. We propose Nix-TTS, a lightweight neural TTS (Text-to-Speech) model achieved by applying knowledge distillation to a powerful yet large-sized ...
Read more >Investigations on speaker adaptation using a continuous ...
This paper presents an investigation of speaker adaptation using a continuous vocoder for parametric text-to-speech (TTS) synthesis.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This happens because the spectrogram frames are upsampled by a factor of 256, and not 275 (your new hop size). Here’s how you can change the upsampling module to go up to a factor of 275:
Sounds like training is progressing as expected. The training loss for this generation of diffusion models has pretty high variance because of the noise schedule sampling procedure so don’t let the fluctuation deter you. The model typically improves even when it looks like the loss has flattened out.
Given that you’re training a multi-speaker model, I recommend training on all speakers for a large number of iterations/epochs, and then fine-tuning on individual speakers if the multi-speaker model isn’t good enough.