A possible approach to pronunciation customization
See original GitHub issueHi, I’m going to re-raise the topic in #12, which is currently closed. I apologize, and I appreciate that this is in some sense bad form.
I also would like the ability to, occasionally, fine-control pronunciation, and I am of the belief that fundamentally it’s not a machine solvable problem, thanks to the literal nightmare that is last names. I know six people who have the same last name by codepoint, but none of them say it the same way, and there’s nothing your software could ever do to cope with that, because it’s unavailable contextual knowledge.
The problem is, if you want to do high quality rendering, getting names right is a sign of respect, so this genuinely matters, and I believe needs to be in some way droppable to user control.
And so I was going to go bug the ocotillo author. Hm. Guess that works out nicely.
I don’t entirely understand where the English <-> Audio mapping comes from, but on a quick glance, it looks like it might be in jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli
.
And so I was wondering.
- How hard would it be to have two of these?
- If the underlying symbolic language was in some way deterministic with regards to end pronunciation - that is, it’s somehow a
least worst case
- how hard would it be to adapt the jbetker thing to a second syllabetry?
The reason being, y’know, the International Phonetic Alphabet is in Unicode, and does a pretty reasonable job with most real world languages. And that would reduce the job to Googling someone’s name once, putting it in a lookup table in IPA, and promptly forgetting about it for eternity.
Which, to me, sounds pretty good.
Or, if you prefer, ask from Siobhan and Pádraig Moloughney from Worchester, Massachusettes
(“shavon and petrick molockney from wooster mass”.)
Let's talk to [ipa:ʃəˈvɔːn] and [ipa:ˈpˠɑːɾˠɪɟː mʌːlɒkːniː] about it
is nicely unambiguous, and fits with the symbology in the other request
Issue Analytics
- State:
- Created a year ago
- Reactions:5
- Comments:32 (14 by maintainers)
Top GitHub Comments
I have been thinking about this over the last two days. In retrospect, I think it would have been absolutely possible to have trained Tortoise to speak both conventional alphabet and phonetic alphabet. There are plenty of datasets out there that use the phonetic alphabet that I could have inserted into training (or I could have trained a wav2vec2 model to transcribe into phonetic AND conventional and then picked one version at random while training Tortoise). So I guess the answer to the question/suggestion here is “yes - I am pretty sure that this is possible”.
As it stands, though, if I wanted to train Tortoise to be able to speak the phonetic alphabet, I’d need to change its symbolic lexicon. I’m a bit nervous that this will involve re-training the autoregressive transformer.
I’m willing to try making this fix, because I agree that this would be a major feature addition, but I cannot currently commit to it. My priority right now is implementing a feature to support the suggestion from #16 because I think the finding there is super cool and it won’t tie up my GPUs, which are currently working on something else. 😃
Lets keep this open, and I will try to get around to it.
I’ve opened up the wandb for this model if anyone is curious to follow along. This project contains all of my training attempts for the autoregressive model. You’ll want to watch the latest runs, titled
unified_large_with_phonetic
. https://wandb.ai/neonbjb/train_gpt_tts