Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature request] [TTS] Support SSML in input text

See original GitHub issue

Is your feature request related to a problem? Please describe.

For TTS, there is a need to choose a specific model or send additional data to the engine on how to handle a part of the text. Examples:

rendering a dialog with different voices
rendering a part with specific emotion (joy, fear, sadness, suprise…)

Describe the solution you’d like Support SSML / coqui markup in input text. Example:

- <tts model="male_voice_1"> Check the box under the tree </tts>
- <tts model="child_voice_1"> This one?   <tts emotion="joy">Wow, it's the Harry Poter lego!</tts>  </tts:model>

Issue Analytics

State:
Created 2 years ago
Reactions:4
Comments:30 (8 by maintainers)

Top GitHub Comments

8reactions

synesthesiamcommented, Sep 28, 2021

@erogol It might be worth moving this to a discussion

I’ve completed my first prototype of 🐸 TTS with SSML support (currently here)! I’m using a gruut side branch for now (supported SSML tags).

Now something like this works:

SSML=$(cat << EOF
<speak>
  <s lang="en">123</s>
  <s lang="de">123</s>
  <s lang="es">123</s>
  <s lang="fr">123</s>
  <s lang="nl">123</s>
</speak>
EOF
)

python3 TTS/bin/synthesize.py \
    --model_name tts_models/en/ljspeech/tacotron2-DDC_ph \
    --extra_model_name tts_models/de/thorsten/tacotron2-DCA \
    --extra_model_name tts_models/es/mai/tacotron2-DDC \
    --extra_model_name tts_models/fr/mai/tacotron2-DDC \
    --extra_model_name tts_models/nl/mai/tacotron2-DDC\
     --text "$SSML" --ssml true --out_path ssml.wav

Which outputs a WAV file with:

“one hundred and twenty three” in English
“einhundertdreiundzwanzig” in German
“ciento veintitrés” in Spanish
“cent vingt trois” in French, and
“honderddrieëntwintig” in Dutch

Before getting any deeper, I wanted to see if I’m on the right track.

The three main changes I’ve made are:

Support for multiple TTS models/SSML input in the Synthesizer
Ability to load additional TTS models when running the server.py and synthesize.py scripts (--extra_model_name)
Changes to the web UI and API to support SSML and TTS model selection

Synthesizer

I created a VoiceConfig class that holds the TTS/vocoder models and configs. When creating a Synthesizer, there is now an extra_voices argument that accepts a list of VoiceConfig objects to load in addition to the “default” voice.

The Synthesizer.tts method now operates in two modes: when the ssml argument is True, it uses gruut to partially parse and split the SSML into multiple sentence objects. Each sentence object is synthesized with the correct TTS model, referenced in one of two ways:

By voice name, such as <voice name="tts_models/en/ljspeech/tacotron2-DDC">...</voice>
- For multi-speaker models, the format name#speaker_idx is used (e.g., tts_models/en/vctk/vits#p228)
By language, such as <s lang="de">...</s>

If no voice or language is specified, the default voice is used.

Command-Line

The server.py and synthesize.py scripts now accept a --extra_model_name argument, which is used to load additional voices by model name:

python3 TTS/server/server.py \
    --model_name tts_models/en/ljspeech/tacotron2-DDC_ph \
    --extra_model_name tts_models/de/thorsten/tacotron2-DCA \
    --extra_model_name tts_models/en/vctk/vits

The default voice is specified as normal (with --model_name or --model_path). All of the extra voices can (currently) only be loaded by name with their default vocoders.

Additionally, the synthesize.py script accepts a --ssml true argument to tell 🐸 TTS that the input text is SSML.

Web UI

coqui-ssml

The two web UI changes are:

SSML checkbox that adds ssml=true to GET variables
Ability to select different voices (only shown if more than one TTS model is loaded)

8reactions

synesthesiamcommented, Sep 19, 2021

Small update: I’ve got preliminary SSML functionality in a side branch of gruut now with support for:

<speak>, <p>, <s>, and <w> tags (allowing for manual tokenization)
<say-as> with support for numbers (cardinal/ordinal/year/digits), dates, currency, and spell-out
<voice> (currently just name)

Numbers, dates, currency, and initialisms are automatically detected and verbalized. I’ve gone the extra mile and made full use of the lang attribute, so you can have:

<speak>
  <w lang="en_US">1</w> 
  <w lang="es_ES">1</w>
</speak>

verbalized as “one uno”. This works for dates, etc. as well, and can even generate phonemes from different languages in the same document. I imagine this could be used in 🐸 TTS with <voice> to generate multi-lingual utterances.

The biggest thing that would help me in completing this feature is deciding on the default settings for non-English languages:

Default date format - can be any combination of day/month/year, where day can be either cardinal (“one”) or ordinal (“first”)
Default currency - which currency symbol/name (e.g., “$” / “USD”)
Default punctuation - what set of characters/strings should (by default) break apart sentences, phrases, and words (e.g., “ninety-nine” -> “ninety”, “nine”)

Top Results From Across the Web

Speech Synthesis Markup Language (SSML) - Google Cloud

You can send Speech Synthesis Markup Language (SSML) in your Text-to-Speech request to allow for more customization in your audio response by providing ......

Supported SSML Tags - Amazon Polly - AWS Documentation

This tag indicates that the input text should be spoken in a whispered voice rather than as normal speech. This can be used...

Speech Synthesis Markup Language (SSML) overview

In this article Speech Synthesis Markup Language (SSML) is an XML-based markup language that can be used to fine-tune the text-to-speech output ...

How do I customize Riva TTS audio output with SSML?

Riva supports portions of SSML, allowing you to adjust pitch, rate, ... Set the SSML Text as the text input for Riva TTS...

Speech Synthesis Markup Language (SSML) reference (Beta)

Speech Synthesis Markup Language (SSML) reference (Beta) · <lang> : Use multiple languages in the same request. · variant : Used as a...