[Bug] Document proper handling of digraphs
See original GitHub issue🐛 Description
I am trying to run 🐸TTS on new languages and came across a bug - or at least something that I think could be improved in the documentation for new languages, unless I missed something, in which case please point it out to me!
The language I am working with has digraphs in its character set - that is, ɡʷ
is a separate character from ɡ
. In TTS.tts.utils.text.text_to_sequence
, the raw text is transformed to a sequence of character indices, but the function that turns the cleaned text into the sequence (TTS.tts.utils.text._symbols_to_sequence
) just iterates through the string, so when processing ɡʷ
, the index for ɡ
is returned and ʷ
is discarded by _should_keep_symbol
. It appears there is a way to handle Arpabet digraphs using curly braces. Another way of handling this could be to reverse sort the list of characters according to length, then tokenize the raw text according to that sorted list of characters and pass the tokenized list to TTS.tts.utils.text._symbols_to_sequence
. This is what I’m currently doing in a custom cleaner, but is this the “correct” or intended way of handling this? I would love if the tensorboard log tracked text as well, for example by logging a comparison of the raw text with the text reconstructed from the sequence as shown in the unittest below. That would have saved me training a few models and then only figuring out the bug by listening to the audio.
To Reproduce
from TTS.tts.utils.text import text_to_sequence, sequence_to_text
from TTS.tts.utils.text.symbols import make_symbols
class TestMohawkCharacterInputs(TestCase):
def setUp(self) -> None:
self.mohawk_characters = {
"pad": "_",
"eos": "~",
"bos": "^",
"characters": sorted(['a', 'aː', 'd', 'd͡ʒ', 'e', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰʷ', 'kʷ', 'n', 'o', 'r', 's', 't', 't͡ʃ', 'w', 'àː', 'á', 'áː', 'èː', 'é', 'éː', 'ìː', 'í', 'íː', 'òː', 'ó', 'óː', 'ũ', 'ũ̀ː', 'ṹ', 'ṹː', 'ɡ', 'ɡʷ', 'ʃ', 'ʌ̃', 'ʌ̃ː', 'ʌ̃̀ː', 'ʌ̃́', 'ʌ̃́ː', 'ʔ', ' '],key=len,reverse=True),
"punctuations": "!(),-.;? ",
"phonemes": sorted(['a', 'aː', 'd', 'd͡ʒ', 'e', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰʷ', 'kʷ', 'n', 'o', 'r', 's', 't', 't͡ʃ', 'w', 'àː', 'á', 'áː', 'èː', 'é', 'éː', 'ìː', 'í', 'íː', 'òː', 'ó', 'óː', 'ũ', 'ũ̀ː', 'ṹ', 'ṹː', 'ɡ', 'ɡʷ', 'ʃ', 'ʌ̃', 'ʌ̃ː', 'ʌ̃̀ː', 'ʌ̃́', 'ʌ̃́ː', 'ʔ'],key=len,reverse=True),
"unique": True
}
self.custom_symbols = make_symbols(**self.mohawk_characters)[0]
self.mohawk_test_text = ["ɡʷah"]
self.cleaners = ["basic_cleaners"]
def test_text_parity(self):
for utt in self.mohawk_test_text:
seq = text_to_sequence(utt, cleaner_names=self.cleaners, custom_symbols=self.custom_symbols, tp=self.mohawk_characters['characters'], add_blank=False)
text = sequence_to_text(seq, tp=self.mohawk_characters, add_blank=False, custom_symbols=self.custom_symbols)
self.assertEqual(text, utt)
Expected behavior
With the above unittest, text_to_sequence("ɡʷah", cleaner_names=self.cleaners, custom_symbols=self.custom_symbols, tp=self.mohawk_characters['characters'], add_blank=False)
returns [45, 26, 30]
when it should return [24, 26, 30]
. It would be nice if digraphs were handled by default, but barring that, it should be documented somewhere (here?) how they should be handled, and ideally tensorboard would perform some input text sanity check.
Environment
{
"CUDA": {
"GPU": [
"Tesla V100-SXM2-16GB"
],
"available": true,
"version": "10.2"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "1.10.1+cu102",
"TTS": "0.5.0",
"numpy": "1.21.2"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "x86_64",
"python": "3.8.12",
"version": "#151-Ubuntu SMP Fri Jun 18 19:21:19 UTC 2021"
}
}
Additional context
Many thanks to the authors for this excellent project!
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (4 by maintainers)
Top GitHub Comments
Unfortunately I don’t think it would be so easy to learn from context. For starters, the languages I am working with are low-resource, and second, these digraphs are often phonemic, so minimal pairs exist, for example
gi
andgʷi
are both valid words that would be encoded the same by this model but have different pronunciations.I definitely think that if the model can’t handle digraphs yet that it might be a good idea to state that on the documentation section for adding a new language.
Here is the cleaner I wrote that solves the problem, but I haven’t fixed the other issue yet (https://github.com/coqui-ai/TTS/issues/1075). It’s not very DRY to have to include the characters from the configuration again, but I couldn’t see a simple way to pass the config to all places the cleaner is used. I will have a look at #937 and see if there would be a good way to integrate this functionality.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.