question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Document proper handling of digraphs

See original GitHub issue

🐛 Description

I am trying to run 🐸TTS on new languages and came across a bug - or at least something that I think could be improved in the documentation for new languages, unless I missed something, in which case please point it out to me!

The language I am working with has digraphs in its character set - that is, ɡʷ is a separate character from ɡ. In TTS.tts.utils.text.text_to_sequence, the raw text is transformed to a sequence of character indices, but the function that turns the cleaned text into the sequence (TTS.tts.utils.text._symbols_to_sequence) just iterates through the string, so when processing ɡʷ, the index for ɡ is returned and ʷ is discarded by _should_keep_symbol. It appears there is a way to handle Arpabet digraphs using curly braces. Another way of handling this could be to reverse sort the list of characters according to length, then tokenize the raw text according to that sorted list of characters and pass the tokenized list to TTS.tts.utils.text._symbols_to_sequence. This is what I’m currently doing in a custom cleaner, but is this the “correct” or intended way of handling this? I would love if the tensorboard log tracked text as well, for example by logging a comparison of the raw text with the text reconstructed from the sequence as shown in the unittest below. That would have saved me training a few models and then only figuring out the bug by listening to the audio.

To Reproduce

from TTS.tts.utils.text import text_to_sequence, sequence_to_text
from TTS.tts.utils.text.symbols import make_symbols 

class TestMohawkCharacterInputs(TestCase):
    def setUp(self) -> None:
        self.mohawk_characters = {
        "pad": "_",
        "eos": "~",
        "bos": "^",
        "characters": sorted(['a', 'aː', 'd', 'd͡ʒ', 'e', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰʷ', 'kʷ', 'n', 'o', 'r', 's', 't', 't͡ʃ', 'w', 'àː', 'á', 'áː', 'èː', 'é', 'éː', 'ìː', 'í', 'íː', 'òː', 'ó', 'óː', 'ũ', 'ũ̀ː', 'ṹ', 'ṹː', 'ɡ', 'ɡʷ', 'ʃ', 'ʌ̃', 'ʌ̃ː', 'ʌ̃̀ː', 'ʌ̃́', 'ʌ̃́ː', 'ʔ', ' '],key=len,reverse=True),
        "punctuations": "!(),-.;? ",
        "phonemes": sorted(['a', 'aː', 'd', 'd͡ʒ', 'e', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰʷ', 'kʷ', 'n', 'o', 'r', 's', 't', 't͡ʃ', 'w', 'àː', 'á', 'áː', 'èː', 'é', 'éː', 'ìː', 'í', 'íː', 'òː', 'ó', 'óː', 'ũ', 'ũ̀ː', 'ṹ', 'ṹː', 'ɡ', 'ɡʷ', 'ʃ', 'ʌ̃', 'ʌ̃ː', 'ʌ̃̀ː', 'ʌ̃́', 'ʌ̃́ː', 'ʔ'],key=len,reverse=True),
        "unique": True
        }
        self.custom_symbols = make_symbols(**self.mohawk_characters)[0]
        self.mohawk_test_text = ["ɡʷah"]
        self.cleaners = ["basic_cleaners"]
        
    def test_text_parity(self):
        for utt in self.mohawk_test_text:
            seq = text_to_sequence(utt, cleaner_names=self.cleaners, custom_symbols=self.custom_symbols, tp=self.mohawk_characters['characters'], add_blank=False)
            text = sequence_to_text(seq, tp=self.mohawk_characters, add_blank=False, custom_symbols=self.custom_symbols)
            self.assertEqual(text, utt)

Expected behavior

With the above unittest, text_to_sequence("ɡʷah", cleaner_names=self.cleaners, custom_symbols=self.custom_symbols, tp=self.mohawk_characters['characters'], add_blank=False) returns [45, 26, 30] when it should return [24, 26, 30]. It would be nice if digraphs were handled by default, but barring that, it should be documented somewhere (here?) how they should be handled, and ideally tensorboard would perform some input text sanity check.

Environment

{
    "CUDA": {
        "GPU": [
            "Tesla V100-SXM2-16GB"
        ],
        "available": true,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.10.1+cu102",
        "TTS": "0.5.0",
        "numpy": "1.21.2"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.8.12",
        "version": "#151-Ubuntu SMP Fri Jun 18 19:21:19 UTC 2021"
    }
}

Additional context

Many thanks to the authors for this excellent project!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
roedoejetcommented, Jan 7, 2022

Unfortunately I don’t think it would be so easy to learn from context. For starters, the languages I am working with are low-resource, and second, these digraphs are often phonemic, so minimal pairs exist, for example gi and gʷi are both valid words that would be encoded the same by this model but have different pronunciations.

I definitely think that if the model can’t handle digraphs yet that it might be a good idea to state that on the documentation section for adding a new language.

Here is the cleaner I wrote that solves the problem, but I haven’t fixed the other issue yet (https://github.com/coqui-ai/TTS/issues/1075). It’s not very DRY to have to include the characters from the configuration again, but I couldn’t see a simple way to pass the config to all places the cleaner is used. I will have a look at #937 and see if there would be a good way to integrate this functionality.

from nltk.tokenize import RegexpTokenizer

from TTS.tts.utils.text.symbols import make_symbols

def mohawk_cleaners(text):
    mohawk_characters = {
        "pad": "_",
        "eos": "~",
        "bos": "^",
        "characters": ['a', 'aː', 'd', 'd͡ʒ', 'e', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰʷ', 'kʷ', 'n', 'o', 'r', 's', 't', 't͡ʃ', 'w', 'àː', 'á', 'áː', 'èː', 'é', 'éː', 'ìː', 'í', 'íː', 'òː', 'ó', 'óː', 'ũ', 'ũ̀ː', 'ṹ', 'ṹː', 'ɡ', 'ɡʷ', 'ʃ', 'ʌ̃', 'ʌ̃ː', 'ʌ̃̀ː', 'ʌ̃́', 'ʌ̃́ː', 'ʔ', ' '],
        "punctuations": "!(),-.;? ",
        "phonemes": ['a', 'aː', 'd', 'd͡ʒ', 'e', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰʷ', 'kʷ', 'n', 'o', 'r', 's', 't', 't͡ʃ', 'w', 'àː', 'á', 'áː', 'èː', 'é', 'éː', 'ìː', 'í', 'íː', 'òː', 'ó', 'óː', 'ũ', 'ũ̀ː', 'ṹ', 'ṹː', 'ɡ', 'ɡʷ', 'ʃ', 'ʌ̃', 'ʌ̃ː', 'ʌ̃̀ː', 'ʌ̃́', 'ʌ̃́ː', 'ʔ'],
        "unique": True
        }
    symbols = make_symbols(**mohawk_characters)[0]
    tokenizer = RegexpTokenizer("|".join(sorted(symbols, key=len, reverse=True)))
    return tokenizer.tokenize(text)
0reactions
stale[bot]commented, Mar 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Digraph Practice Lists - Spelling City
The only thing better than the digraph practice lists and activities in our affordable membership is the time it saves teachers! See how...
Read more >
329153 – add checker for digraphs and trigraphs. - Bugs
Anyway- here's the checker + two quick fixes: - replace the sequence by "normal" token - escape the sequence Trigraphs are checked before...
Read more >
C++ static code analysis: Digraphs should not be used
C++ static code analysis. Unique rules to find Bugs, Vulnerabilities, Security Hotspots, and Code Smells in your C++ code.
Read more >
Decode One-Syllable Words with Vowel Digraphs - Goalbook
These resources are available both as PDFs and Google Documents. Answer keys included. ... Read grade-appropriate irregularly spelled words.
Read more >
Digraph Desk Teaching Resources - TPT
Browse digraph desk resources on Teachers Pay Teachers, a marketplace trusted by millions of teachers for original educational ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found