question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature request] [TTS] Support SSML in input text

See original GitHub issue

Is your feature request related to a problem? Please describe.

For TTS, there is a need to choose a specific model or send additional data to the engine on how to handle a part of the text. Examples:

  • rendering a dialog with different voices
  • rendering a part with specific emotion (joy, fear, sadness, suprise…)

Describe the solution you’d like Support SSML / coqui markup in input text. Example:

- <tts model="male_voice_1"> Check the box under the tree </tts>
- <tts model="child_voice_1"> This one?   <tts emotion="joy">Wow, it's the Harry Poter lego!</tts>  </tts:model>

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:4
  • Comments:30 (8 by maintainers)

github_iconTop GitHub Comments

8reactions
synesthesiamcommented, Sep 28, 2021

@erogol It might be worth moving this to a discussion

I’ve completed my first prototype of 🐸 TTS with SSML support (currently here)! I’m using a gruut side branch for now (supported SSML tags).

Now something like this works:

SSML=$(cat << EOF
<speak>
  <s lang="en">123</s>
  <s lang="de">123</s>
  <s lang="es">123</s>
  <s lang="fr">123</s>
  <s lang="nl">123</s>
</speak>
EOF
)

python3 TTS/bin/synthesize.py \
    --model_name tts_models/en/ljspeech/tacotron2-DDC_ph \
    --extra_model_name tts_models/de/thorsten/tacotron2-DCA \
    --extra_model_name tts_models/es/mai/tacotron2-DDC \
    --extra_model_name tts_models/fr/mai/tacotron2-DDC \
    --extra_model_name tts_models/nl/mai/tacotron2-DDC\
     --text "$SSML" --ssml true --out_path ssml.wav

Which outputs a WAV file with:

  • “one hundred and twenty three” in English
  • “einhundertdreiundzwanzig” in German
  • “ciento veintitrés” in Spanish
  • “cent vingt trois” in French, and
  • “honderddrieëntwintig” in Dutch

Before getting any deeper, I wanted to see if I’m on the right track.

The three main changes I’ve made are:

  1. Support for multiple TTS models/SSML input in the Synthesizer
  2. Ability to load additional TTS models when running the server.py and synthesize.py scripts (--extra_model_name)
  3. Changes to the web UI and API to support SSML and TTS model selection

Synthesizer

I created a VoiceConfig class that holds the TTS/vocoder models and configs. When creating a Synthesizer, there is now an extra_voices argument that accepts a list of VoiceConfig objects to load in addition to the “default” voice.

The Synthesizer.tts method now operates in two modes: when the ssml argument is True, it uses gruut to partially parse and split the SSML into multiple sentence objects. Each sentence object is synthesized with the correct TTS model, referenced in one of two ways:

  • By voice name, such as <voice name="tts_models/en/ljspeech/tacotron2-DDC">...</voice>
    • For multi-speaker models, the format name#speaker_idx is used (e.g., tts_models/en/vctk/vits#p228)
  • By language, such as <s lang="de">...</s>

If no voice or language is specified, the default voice is used.

Command-Line

The server.py and synthesize.py scripts now accept a --extra_model_name argument, which is used to load additional voices by model name:

python3 TTS/server/server.py \
    --model_name tts_models/en/ljspeech/tacotron2-DDC_ph \
    --extra_model_name tts_models/de/thorsten/tacotron2-DCA \
    --extra_model_name tts_models/en/vctk/vits

The default voice is specified as normal (with --model_name or --model_path). All of the extra voices can (currently) only be loaded by name with their default vocoders.

Additionally, the synthesize.py script accepts a --ssml true argument to tell 🐸 TTS that the input text is SSML.

Web UI

coqui-ssml

The two web UI changes are:

  • SSML checkbox that adds ssml=true to GET variables
  • Ability to select different voices (only shown if more than one TTS model is loaded)
8reactions
synesthesiamcommented, Sep 19, 2021

Small update: I’ve got preliminary SSML functionality in a side branch of gruut now with support for:

  • <speak>, <p>, <s>, and <w> tags (allowing for manual tokenization)
  • <say-as> with support for numbers (cardinal/ordinal/year/digits), dates, currency, and spell-out
  • <voice> (currently just name)

Numbers, dates, currency, and initialisms are automatically detected and verbalized. I’ve gone the extra mile and made full use of the lang attribute, so you can have:

<speak>
  <w lang="en_US">1</w> 
  <w lang="es_ES">1</w>
</speak>

verbalized as “one uno”. This works for dates, etc. as well, and can even generate phonemes from different languages in the same document. I imagine this could be used in 🐸 TTS with <voice> to generate multi-lingual utterances.

The biggest thing that would help me in completing this feature is deciding on the default settings for non-English languages:

  • Default date format - can be any combination of day/month/year, where day can be either cardinal (“one”) or ordinal (“first”)
  • Default currency - which currency symbol/name (e.g., “$” / “USD”)
  • Default punctuation - what set of characters/strings should (by default) break apart sentences, phrases, and words (e.g., “ninety-nine” -> “ninety”, “nine”)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Speech Synthesis Markup Language (SSML) - Google Cloud
You can send Speech Synthesis Markup Language (SSML) in your Text-to-Speech request to allow for more customization in your audio response by providing ......
Read more >
Supported SSML Tags - Amazon Polly - AWS Documentation
This tag indicates that the input text should be spoken in a whispered voice rather than as normal speech. This can be used...
Read more >
Speech Synthesis Markup Language (SSML) overview
In this article​​ Speech Synthesis Markup Language (SSML) is an XML-based markup language that can be used to fine-tune the text-to-speech output ...
Read more >
How do I customize Riva TTS audio output with SSML?
Riva supports portions of SSML, allowing you to adjust pitch, rate, ... Set the SSML Text as the text input for Riva TTS...
Read more >
Speech Synthesis Markup Language (SSML) reference (Beta)
Speech Synthesis Markup Language (SSML) reference (Beta) · <lang> : Use multiple languages in the same request. · variant : Used as a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found