[Feature request] [TTS] Support SSML in input text
See original GitHub issueIs your feature request related to a problem? Please describe.
For TTS, there is a need to choose a specific model or send additional data to the engine on how to handle a part of the text. Examples:
- rendering a dialog with different voices
- rendering a part with specific emotion (joy, fear, sadness, suprise…)
Describe the solution you’d like Support SSML / coqui markup in input text. Example:
- <tts model="male_voice_1"> Check the box under the tree </tts>
- <tts model="child_voice_1"> This one? <tts emotion="joy">Wow, it's the Harry Poter lego!</tts> </tts:model>
Issue Analytics
- State:
- Created 2 years ago
- Reactions:4
- Comments:30 (8 by maintainers)
Top Results From Across the Web
Speech Synthesis Markup Language (SSML) - Google Cloud
You can send Speech Synthesis Markup Language (SSML) in your Text-to-Speech request to allow for more customization in your audio response by providing ......
Read more >Supported SSML Tags - Amazon Polly - AWS Documentation
This tag indicates that the input text should be spoken in a whispered voice rather than as normal speech. This can be used...
Read more >Speech Synthesis Markup Language (SSML) overview
In this article Speech Synthesis Markup Language (SSML) is an XML-based markup language that can be used to fine-tune the text-to-speech output ...
Read more >How do I customize Riva TTS audio output with SSML?
Riva supports portions of SSML, allowing you to adjust pitch, rate, ... Set the SSML Text as the text input for Riva TTS...
Read more >Speech Synthesis Markup Language (SSML) reference (Beta)
Speech Synthesis Markup Language (SSML) reference (Beta) · <lang> : Use multiple languages in the same request. · variant : Used as a...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@erogol It might be worth moving this to a discussion
I’ve completed my first prototype of 🐸 TTS with SSML support (currently here)! I’m using a gruut side branch for now (supported SSML tags).
Now something like this works:
Which outputs a WAV file with:
Before getting any deeper, I wanted to see if I’m on the right track.
The three main changes I’ve made are:
Synthesizer
server.py
andsynthesize.py
scripts (--extra_model_name
)Synthesizer
I created a
VoiceConfig
class that holds the TTS/vocoder models and configs. When creating aSynthesizer
, there is now anextra_voices
argument that accepts a list ofVoiceConfig
objects to load in addition to the “default” voice.The
Synthesizer.tts
method now operates in two modes: when thessml
argument isTrue
, it uses gruut to partially parse and split the SSML into multiple sentence objects. Each sentence object is synthesized with the correct TTS model, referenced in one of two ways:<voice name="tts_models/en/ljspeech/tacotron2-DDC">...</voice>
name#speaker_idx
is used (e.g.,tts_models/en/vctk/vits#p228
)<s lang="de">...</s>
If no voice or language is specified, the default voice is used.
Command-Line
The
server.py
andsynthesize.py
scripts now accept a--extra_model_name
argument, which is used to load additional voices by model name:The default voice is specified as normal (with
--model_name
or--model_path
). All of the extra voices can (currently) only be loaded by name with their default vocoders.Additionally, the
synthesize.py
script accepts a--ssml true
argument to tell 🐸 TTS that the input text is SSML.Web UI
The two web UI changes are:
ssml=true
toGET
variablesSmall update: I’ve got preliminary SSML functionality in a side branch of gruut now with support for:
<speak>
,<p>
,<s>
, and<w>
tags (allowing for manual tokenization)<say-as>
with support for numbers (cardinal/ordinal/year/digits), dates, currency, andspell-out
<voice>
(currently justname
)Numbers, dates, currency, and initialisms are automatically detected and verbalized. I’ve gone the extra mile and made full use of the
lang
attribute, so you can have:verbalized as “one uno”. This works for dates, etc. as well, and can even generate phonemes from different languages in the same document. I imagine this could be used in 🐸 TTS with
<voice>
to generate multi-lingual utterances.The biggest thing that would help me in completing this feature is deciding on the default settings for non-English languages: