Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Examples of speech tensor shape for gst models?

See original GitHub issue

I’m trying to figure out how to use models from here in the espnet2 colab demo: https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tts_realtime_demo.ipynb

I’m using these values for tag, vocoder_tag, etc:

fs, lang = 24000, "English"
tag = "kan-bayashi/vctk_tts_train_gst_tacotron2_raw_phn_tacotron_g2p_en_no_space_train.loss.best"
vocoder_tag = "ljspeech_multi_band_melgan.v2"

And trying to get it to run by setting a random value for speech

x = "This is my favorite sentence!"

speech = torch.randn(512, 80) # this is wrong

# synthesis
with torch.no_grad():
    start = time.time()
    wav, c, *_ = text2speech(x, speech=speech)
    wav = vocoder.inference(c)
rtf = (time.time() - start) / (len(wav) / fs)
print(f"RTF = {rtf:5f}")

# let us listen to generated samples
from IPython.display import display, Audio
display(Audio(wav.view(-1).cpu().numpy(), rate=fs))

But I get the error:

RuntimeError: Padding size should be less than the corresponding input dimension, but got: padding (1024, 1024) at dimension 2 of input (1,.,.) = ...

I feel like this is because the shape of speech is wrong. Are there any examples of how to compute a proper value for it?

I see in the code it’s expecting shape (Lmax, idim). I can see that odim is 80 and when I look at the value of text2speech.tts.odim.

On the other hand, perhaps I’ve misunderstood something obvious about how to approach this. 😃

Issue Analytics

State:
Created 3 years ago
Comments:10 (6 by maintainers)

Top GitHub Comments

1reaction

kan-bayashicommented, Sep 19, 2020

I added multi-speaker example in the notebook. Please enjoy! https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tts_realtime_demo.ipynb

0reactions

kan-bayashicommented, Sep 23, 2020

Try https://github.com/espnet/espnet/tree/master/egs2/vctk/tts1 See the usage https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1

Top Results From Across the Web

Gst-nvds_text_to_speech (Alpha) — DeepStream 6.1.1 ...

The Gst-nvds_text_to_speech plugin performs speech synthesis on the input text. Currently it supports only x86 platform.

Simple audio recognition: Recognizing keywords - TensorFlow

Simple audio recognition: Recognizing keywords · Setup · Import the mini Speech Commands dataset · Convert waveforms to spectrograms · Build and train...

Source code for espnet2.tts.fastspeech.fastspeech

This is a module of FastSpeech, feed-forward Transformer with duration predictor described in `FastSpeech: Fast, Robust and Controllable Text to Speech`_, ...

Vehicle and Pedestrian Tracking Sample (gst-launch ...

Object tracking increases performance by running inference on object detection and classification models less frequently (not every frame). How It Works¶. The ...

i.MX Machine Learning User's Guide - NXP

models. The example binary file is located at: /usr/bin/tensorflow-lite-2.9.1/examples. Figure 4. TensorFlow image classification input.