Generating good audio samples
See original GitHub issueLet’s discuss strategies for producing audio samples. When running over the entire dataset, I’ve so far only managed to reproduce recording noise and clicks.
Some ideas I’ve had to improve on this:
- We should limit ourselves to a single speaker for now. That will allow us to perform multiple epochs on the train dataset. We could also try overfitting the dataset a little, which should result in the network reproducing pieces of the train dataset.
- Remove silence from the recordings. Many of the recordings have periods of recording noise before and after the speakers. It might be worth removing these with
librosa
.
Issue Analytics
- State:
- Created 7 years ago
- Comments:86 (53 by maintainers)
Top Results From Across the Web
How To Find High-Quality Samples for Music Production (and ...
The Best Sample Site Rundown · 1. Splice · 2. Converse Rubber Tracks Sample Library · 3. Cymatics · 4. Looperman.
Read more >How To Sample Music: All You Need To Know - EDMProd
In this guide we are going to cover everything you need to know about how to sample in music, what sampling is and...
Read more >How to Create your Own Audio Samples - Complete Tutorial
While it isn't my most professional sound, as an example for this tutorial I am going to show you how to render a...
Read more >10 Tips To Create Your Sample Pack (2021)
The universally accepted format of samples is WAV, which provides high-quality uncompressed audio and is compatible with most DAWs. Sounds are usually exported ......
Read more >How To Record and Create Your Own Sample Library
You can record audio of anything you're interested in – get creative, because the possibilities are limitless! There are some great examples of ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Very cool work, guys! As a text-to-speech person, I am excited to see where this effort may lead.
As far as generating good-sounding output, I believe I have some thoughts to add regarding point 3 in @jyegerlehner’s list, on the use of floating point values vs. one-hot vectors for the network inputs. I hope this is the right issue in which to post them.
I met with Heiga Zen, one of the authors of the WaveNet paper, at a speech synthesis workshop last week. I quizzed him quite a bit on the paper when I had the chance. My understanding is that there are two key motivations for using (mu-law companded) one-hot vectors for the single-sample network output:
Note that both these key concerns only are relevant at the output layer, not at the input layer. As far as the input representation goes, scalar floating-point values have several advantages over a one-hot vector discrete representation:
Seeing that WaveNet is based on PixelCNNs, it might be instructive to consider how the latter handle and encode their inputs There appears to be a working implementation of pixelCNNs on GitHub, but I haven’t looked sufficiently deeply into it to tell how they encode their input.
Has everyone been reproducing ibab’s results? I got a result similar to his, but I think it sounds a bit smoother; I’m guessing because the receptive field is a little bigger than his.
2 seconds: https://soundcloud.com/user-731806733/speaker-p280-from-vctk-corpus-1
10 seconds: https://soundcloud.com/user-731806733/speaker-280-from-vctk-corpus-2
[Edit] After mortont comment below: I used learning_rate=0.001.
I made a copy of the corpus directory, except I only copied over the directory for speaker p280. I stopped training at about 28K steps, to follow ibab’s example. Loss was a bit lower than his, around 2.0-2.1.
I think to get pauses between words and such we need to a wider receptive field. That’s my next step.
By the way, anyone know how to make soundcloud loop the playback, instead of playing music at the end of the clip, like ibab did? Pro account needed for that?
[Edit] Here’s one from a model with that has about 250 mSec receptive field, trained for about 16 hours: https://soundcloud.com/user-731806733/generated-larger-1