Examples of speech tensor shape for gst models?
See original GitHub issueI’m trying to figure out how to use models from here in the espnet2 colab demo: https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tts_realtime_demo.ipynb
I’m using these values for tag
, vocoder_tag
, etc:
fs, lang = 24000, "English"
tag = "kan-bayashi/vctk_tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space_train.loss.best"
vocoder_tag = "ljspeech_multi_band_melgan.v2"
And trying to get it to run by setting a random value for speech
x = "This is my favorite sentence!"
speech = torch.randn(512, 80) # this is wrong
# synthesis
with torch.no_grad():
start = time.time()
wav, c, *_ = text2speech(x, speech=speech)
wav = vocoder.inference(c)
rtf = (time.time() - start) / (len(wav) / fs)
print(f"RTF = {rtf:5f}")
# let us listen to generated samples
from IPython.display import display, Audio
display(Audio(wav.view(-1).cpu().numpy(), rate=fs))
But I get the error:
RuntimeError: Padding size should be less than the corresponding input dimension, but got: padding (1024, 1024) at dimension 2 of input (1,.,.) = ...
I feel like this is because the shape of speech is wrong. Are there any examples of how to compute a proper value for it?
I see in the code it’s expecting shape (Lmax, idim)
. I can see that odim
is 80
and when I look at the value of text2speech.tts.odim
.
On the other hand, perhaps I’ve misunderstood something obvious about how to approach this. 😃
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (6 by maintainers)
Top Results From Across the Web
Gst-nvds_text_to_speech (Alpha) — DeepStream 6.1.1 ...
The Gst-nvds_text_to_speech plugin performs speech synthesis on the input text. Currently it supports only x86 platform.
Read more >Simple audio recognition: Recognizing keywords - TensorFlow
Simple audio recognition: Recognizing keywords · Setup · Import the mini Speech Commands dataset · Convert waveforms to spectrograms · Build and train...
Read more >Source code for espnet2.tts.fastspeech.fastspeech
This is a module of FastSpeech, feed-forward Transformer with duration predictor described in `FastSpeech: Fast, Robust and Controllable Text to Speech`_, ...
Read more >Vehicle and Pedestrian Tracking Sample (gst-launch ...
Object tracking increases performance by running inference on object detection and classification models less frequently (not every frame). How It Works¶. The ...
Read more >i.MX Machine Learning User's Guide - NXP
models. The example binary file is located at: /usr/bin/tensorflow-lite-2.9.1/examples. Figure 4. TensorFlow image classification input.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I added multi-speaker example in the notebook. Please enjoy! https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tts_realtime_demo.ipynb
Try https://github.com/espnet/espnet/tree/master/egs2/vctk/tts1 See the usage https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1