question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Truncated audio when generating speech

See original GitHub issue

Describe the bug

When generating speech audio from the following text, the generated file contains only a truncated speech audio, that is, the speech audio is interrupted before the sentences are pronounced.

Text:

Multiple debugging information entries may share the same abbreviation table entry. Each compilation unit is associated with a particular abbreviation table, but multiple compilation units may share the same table.

Generated speech audio: tts_output.zip

To Reproduce

$ tts --text 'Multiple debugging information entries may share the same abbreviation table entry. Each compilation unit is associated with a particular abbreviation table, but multiple compilation units may share the same table.'

Expected behavior

Generation of speech audio for the full input text.

Logs

$ tts --text 'Multiple debugging information entries may share the same abbreviation table entry. Each compilation unit is associated with a particular abbreviation table, but multiple compilation units may share the same table.'
 > tts_models/en/ljspeech/tacotron2-DDC is already downloaded.
 > vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.
 > Using model: Tacotron2
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 > Model's reduction rate `r` is set to: 1
 > Vocoder Model: hifigan
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 > Generator Model: hifigan_generator
 > Discriminator Model: hifigan_discriminator
Removing weight norm...
 > Text: Multiple debugging information entries may share the same abbreviation table entry. Each compilation unit is associated with a particular abbreviation table, but multiple compilation units may share the same table.
 > Text splitted to sentences.
['Multiple debugging information entries may share the same abbreviation table entry.', 'Each compilation unit is associated with a particular abbreviation table, but multiple compilation units may share the same table.']
   > Decoder stopped with `max_decoder_steps` 500
 > Processing time: 5.151856899261475
 > Real-time factor: 0.4295162001993176
 > Saving output to tts_output.wav

Environment

{
    "CUDA": {
        "GPU": [
            "Quadro RTX 5000 with Max-Q Design"
        ],
        "available": true,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0+cu102",
        "TTS": "0.6.2",
        "numpy": "1.19.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "",
        "python": "3.9.12",
        "version": "#1 SMP PREEMPT Debian 5.16.18-1 (2022-03-29)"
    }
}

Additional context

No response

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
koutheircommented, Apr 21, 2022

There is a limit on the sentence length for the default model.

  • What is that exact limit?
  • How can one find this information for a particular model?
  • How is the length calculated (character count, word count, etc.)?
1reaction
a-t-0commented, Jul 9, 2022

I would also like to know how to know the information @koutheir asked for 😃 :

* What is that exact limit?

* How can one find this information for a particular model?

* How is the length calculated (character count, word count, etc.)?

So it would be nice if the question could be re-opened.

FYI, here is a work-around given in this issue: Add:

"max_decoder_steps": 5000

at the end of the config file of the (default) model, in:

~/.local/share/tts/vocoder_models--en--ljspeech--hifigan_v2/config.json

Note, json expects a comma at the end of a line if the line is followed by another property/line. So you have to add a comma to the previous last line, and ensure you don’t add a comma to your own "max_decoder_steps": 5000 command. TL;DR: make the last lines look like:

    // PATHS
    "output_path": "/home/erogol/gdrive/Trainings/sam/",
    // Custom limit made larger
    "max_decoder_steps": 5000
}

Read more comments on GitHub >

github_iconTop Results From Across the Web

Audio bug: Beginning parts of speech are cut off, page 2 - Forum
One fix that worked for me: Try going in to options and switching to a different resolution. Then switch to your default desktop...
Read more >
TTS troubleshooting tips - IBM
A synthesized audio output is truncated​​ The amount of text included in one speak request is limited to the amount of text that...
Read more >
Stops working after long gap with no speech? #29 - GitHub
Truncating off the non-speech segment and restarting gets me a sensible transcription. I can put up a test file if it would be...
Read more >
Applying noise reduction techniques and restoration effects
To achieve the best results with the Noise Reduction effect, apply it to audio with no DC offset. With a DC offset, this...
Read more >
python - Truncated speech-to-text output from wav file with ...
@BhavyaParikh I guess should clarify that I'm not getting an error per se, but rather the recognizer is not computing the entire audio...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found