question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Examples of speech tensor shape for gst models?

See original GitHub issue

I’m trying to figure out how to use models from here in the espnet2 colab demo: https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tts_realtime_demo.ipynb

I’m using these values for tag, vocoder_tag, etc:

fs, lang = 24000, "English"
tag = "kan-bayashi/vctk_tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space_train.loss.best"
vocoder_tag = "ljspeech_multi_band_melgan.v2"

And trying to get it to run by setting a random value for speech

x = "This is my favorite sentence!"

speech = torch.randn(512, 80) # this is wrong

# synthesis
with torch.no_grad():
    start = time.time()
    wav, c, *_ = text2speech(x, speech=speech)
    wav = vocoder.inference(c)
rtf = (time.time() - start) / (len(wav) / fs)
print(f"RTF = {rtf:5f}")

# let us listen to generated samples
from IPython.display import display, Audio
display(Audio(wav.view(-1).cpu().numpy(), rate=fs))

But I get the error:

RuntimeError: Padding size should be less than the corresponding input dimension, but got: padding (1024, 1024) at dimension 2 of input (1,.,.) = ...

I feel like this is because the shape of speech is wrong. Are there any examples of how to compute a proper value for it?

I see in the code it’s expecting shape (Lmax, idim). I can see that odim is 80 and when I look at the value of text2speech.tts.odim.

On the other hand, perhaps I’ve misunderstood something obvious about how to approach this. 😃

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

github_iconTop Results From Across the Web

Gst-nvds_text_to_speech (Alpha) — DeepStream 6.1.1 ...
The Gst-nvds_text_to_speech plugin performs speech synthesis on the input text. Currently it supports only x86 platform.
Read more >
Simple audio recognition: Recognizing keywords - TensorFlow
Simple audio recognition: Recognizing keywords · Setup · Import the mini Speech Commands dataset · Convert waveforms to spectrograms · Build and train...
Read more >
Source code for espnet2.tts.fastspeech.fastspeech
This is a module of FastSpeech, feed-forward Transformer with duration predictor described in `FastSpeech: Fast, Robust and Controllable Text to Speech`_, ...
Read more >
Vehicle and Pedestrian Tracking Sample (gst-launch ...
Object tracking increases performance by running inference on object detection and classification models less frequently (not every frame). How It Works¶. The ...
Read more >
i.MX Machine Learning User's Guide - NXP
models. The example binary file is located at: /usr/bin/tensorflow-lite-2.9.1/examples. Figure 4. TensorFlow image classification input.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found