question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Generating good audio samples

See original GitHub issue

Let’s discuss strategies for producing audio samples. When running over the entire dataset, I’ve so far only managed to reproduce recording noise and clicks.

Some ideas I’ve had to improve on this:

  • We should limit ourselves to a single speaker for now. That will allow us to perform multiple epochs on the train dataset. We could also try overfitting the dataset a little, which should result in the network reproducing pieces of the train dataset.
  • Remove silence from the recordings. Many of the recordings have periods of recording noise before and after the speakers. It might be worth removing these with librosa.

Issue Analytics

  • State:open
  • Created 7 years ago
  • Comments:86 (53 by maintainers)

github_iconTop GitHub Comments

13reactions
ghentercommented, Sep 23, 2016

Very cool work, guys! As a text-to-speech person, I am excited to see where this effort may lead.

As far as generating good-sounding output, I believe I have some thoughts to add regarding point 3 in @jyegerlehner’s list, on the use of floating point values vs. one-hot vectors for the network inputs. I hope this is the right issue in which to post them.

I met with Heiga Zen, one of the authors of the WaveNet paper, at a speech synthesis workshop last week. I quizzed him quite a bit on the paper when I had the chance. My understanding is that there are two key motivations for using (mu-law companded) one-hot vectors for the single-sample network output:

  1. This turns the problem from a regression task to a classification task. For some reason, DNNs have seen greater success in classification than in regression. (This has motivated the research into generative adversarial networks, which is another hot topic at the moment.) Up until now, most DNN-based waveform/audio generation approaches were formulated as regression problems.
  2. A softmax output layer allows a flexible representation of the distribution of possible output values, from which the next value is generated by sampling. Empirically, this worked better than parametrising the output distribution using GMMs (i.e., a mixture density network).

Note that both these key concerns only are relevant at the output layer, not at the input layer. As far as the input representation goes, scalar floating-point values have several advantages over a one-hot vector discrete representation:

  • Scalar inputs have lower dimensionality, requiring fewer parameters in the network. (They are compact and dense instead of a factor 256 larger and sparse.)
  • Using floats does not introduce (additional) quantisation noise.
  • Applying convolutions to floating point values is interpretable as a filter, as @jyegerlehner said. The effect of applying convolutions to one-hot vectors, in contrast, is opaque.
  • Finally, and most importantly, the actual waveform sample values are numerical, so they have both a magnitude and an internal ordering. These properties matter hugely. Feeding in a categorical representation (one-hot vectors) would essentially force the network to learn the relative values and ordering associated with each input node, in order to make sense of the input. Since there are something like 256 values x 300 ms x 16 kHz = 1.2 million one-hot input nodes, this is a formidable learning task that is entirely avoided by using a floating point representation.

Seeing that WaveNet is based on PixelCNNs, it might be instructive to consider how the latter handle and encode their inputs There appears to be a working implementation of pixelCNNs on GitHub, but I haven’t looked sufficiently deeply into it to tell how they encode their input.

7reactions
jyegerlehnercommented, Sep 26, 2016

Has everyone been reproducing ibab’s results? I got a result similar to his, but I think it sounds a bit smoother; I’m guessing because the receptive field is a little bigger than his.

2 seconds: https://soundcloud.com/user-731806733/speaker-p280-from-vctk-corpus-1

10 seconds: https://soundcloud.com/user-731806733/speaker-280-from-vctk-corpus-2

{
    "filter_width": 2,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2],
    "residual_channels": 32,
    "dilation_channels":32,
    "quantization_channels": 256,
    "skip_channels": 1024,
    "use_biases": true
}

[Edit] After mortont comment below: I used learning_rate=0.001.

I made a copy of the corpus directory, except I only copied over the directory for speaker p280. I stopped training at about 28K steps, to follow ibab’s example. Loss was a bit lower than his, around 2.0-2.1.

I think to get pauses between words and such we need to a wider receptive field. That’s my next step.

By the way, anyone know how to make soundcloud loop the playback, instead of playing music at the end of the clip, like ibab did? Pro account needed for that?

[Edit] Here’s one from a model with that has about 250 mSec receptive field, trained for about 16 hours: https://soundcloud.com/user-731806733/generated-larger-1

Read more comments on GitHub >

github_iconTop Results From Across the Web

How To Find High-Quality Samples for Music Production (and ...
The Best Sample Site Rundown · 1. Splice · 2. Converse Rubber Tracks Sample Library · 3. Cymatics · 4. Looperman.
Read more >
How To Sample Music: All You Need To Know - EDMProd
In this guide we are going to cover everything you need to know about how to sample in music, what sampling is and...
Read more >
How to Create your Own Audio Samples - Complete Tutorial
While it isn't my most professional sound, as an example for this tutorial I am going to show you how to render a...
Read more >
10 Tips To Create Your Sample Pack (2021)
The universally accepted format of samples is WAV, which provides high-quality uncompressed audio and is compatible with most DAWs. Sounds are usually exported ......
Read more >
How To Record and Create Your Own Sample Library
You can record audio of anything you're interested in – get creative, because the possibilities are limitless! There are some great examples of ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found