question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question: Multiple speakers voice for training data

See original GitHub issue

TL;DR

Is it better to use the same speaker’s voice data for learning?

Question

I am a student studying speech synthesis at university in Japan.

I read paper about Tacotron and wanted to try it myself. So, I try to train tacotron in Japanese. (I was able to confirm that training in English works as intended when using the training data available in the repository)

The problem is that encoder/decoder alignment for Japanese would not be learned well.

Here is step-23000-align.png

For training, the data used is speech data from three speakers, each reading 100 sentences.

I suppose the reason why alignment fails, is as follows:

  • Too few training data samples
  • It is learning data of 3 different speakers (multiple speakers)

Thank you.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
r9y9commented, Nov 8, 2017

If you are building multi-speaker models (like DeepVoice2, or 3), it should be okay to use data from multiple speakers. However, I’m pretty sure that the reason you got the non-monotonic alignment is that you don’t have sufficient data. https://sites.google.com/site/shinnosuketakamichi/publication/jsut is a freely available Japanese dataset that might be useful for you. The dataset consists of 10 hours audio recordings of a single female speaker. I just started to explore the dataset today with DeepVoice3 architecture and can get nearly monotonic attention very quickly as follows:

step000005000_layer_1_alignment

Code is available at https://github.com/r9y9/deepvoice3_pytorch. If you are interested, feel free to contact me.

0reactions
rildcommented, Nov 9, 2017

@r9y9

If you are interested, feel free to contact me.

I feel very encouraged to hear that. Thank you so much. I would like to send a email you later.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training Speech Synthesizers on Data from Multiple Speakers
An NTTS system trained on data from seven different speakers doesn't sound like an average of seven different voices. When we train our...
Read more >
Multi-Speaker Recognition and Speaker Specific Question ...
In our system, we have trained a neural network classifier to work for multiple concurrent speakers while providing a limited domain speaker specific...
Read more >
Can we use Common Voice to train a Multi-Speaker TTS ...
In this paper, we propose to automatically select high-quality training samples using a non-intrusive mean opinion score (MOS) estimator, WV-MOS ...
Read more >
Text to Speech System for Multi-Speaker Setting
Training a TTS model essentially involves two tasks: 1) Learning how to convert the text to the content of audio, 2) Learning the...
Read more >
Detect different speakers in an audio recording - Google Cloud
When you enable speaker diarization in your transcription request, Speech-to-Text attempts to distinguish the different voices included in the audio sample. The ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found