Question: Multiple speakers voice for training data
See original GitHub issueTL;DR
Is it better to use the same speaker’s voice data for learning?
Question
I am a student studying speech synthesis at university in Japan.
I read paper about Tacotron and wanted to try it myself. So, I try to train tacotron in Japanese. (I was able to confirm that training in English works as intended when using the training data available in the repository)
The problem is that encoder/decoder alignment for Japanese would not be learned well.
Here is step-23000-align.png
For training, the data used is speech data from three speakers, each reading 100 sentences.
I suppose the reason why alignment fails, is as follows:
- Too few training data samples
- It is learning data of 3 different speakers (multiple speakers)
Thank you.
Issue Analytics
- State:
- Created 6 years ago
- Comments:6 (1 by maintainers)
Top Results From Across the Web
Training Speech Synthesizers on Data from Multiple Speakers
An NTTS system trained on data from seven different speakers doesn't sound like an average of seven different voices. When we train our...
Read more >Multi-Speaker Recognition and Speaker Specific Question ...
In our system, we have trained a neural network classifier to work for multiple concurrent speakers while providing a limited domain speaker specific...
Read more >Can we use Common Voice to train a Multi-Speaker TTS ...
In this paper, we propose to automatically select high-quality training samples using a non-intrusive mean opinion score (MOS) estimator, WV-MOS ...
Read more >Text to Speech System for Multi-Speaker Setting
Training a TTS model essentially involves two tasks: 1) Learning how to convert the text to the content of audio, 2) Learning the...
Read more >Detect different speakers in an audio recording - Google Cloud
When you enable speaker diarization in your transcription request, Speech-to-Text attempts to distinguish the different voices included in the audio sample. The ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
If you are building multi-speaker models (like DeepVoice2, or 3), it should be okay to use data from multiple speakers. However, I’m pretty sure that the reason you got the non-monotonic alignment is that you don’t have sufficient data. https://sites.google.com/site/shinnosuketakamichi/publication/jsut is a freely available Japanese dataset that might be useful for you. The dataset consists of 10 hours audio recordings of a single female speaker. I just started to explore the dataset today with DeepVoice3 architecture and can get nearly monotonic attention very quickly as follows:
Code is available at https://github.com/r9y9/deepvoice3_pytorch. If you are interested, feel free to contact me.
@r9y9
I feel very encouraged to hear that. Thank you so much. I would like to send a email you later.