Usage of audio_slice_frames, sample_frames, pad
See original GitHub issueHello,
I saw that you used pad
, audio_slice_frames
, sample_frames
but I can’t understand the usage of those params. Can you explain the meanings of them?
Also, WaveRNN
model was using padded mel input in the first GRU layer. However you just sliced out paddings after the first layer. Is it important to use padded mel in first GRU?
Thanks.
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
What does SampleFrames means? · Discussion #655 - GitHub
SampleFrames defines sample strategy for input frames. Sample strategy is defined as clip_len x frame_interval x num_clips .
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @macarbonneau,
No problem, I’m glad you found the repo useful. I haven’t tried using the end (or beginning) segments but there’s no real reason it shouldn’t work. The thinking behind using the middle segment was to match the training and inference conditions as much as possible. At inference time most of the input to the autoregressive part of the model (
rnn2
) will have context from the future and the past. So taking the middle segment is “closer” to what the network will see during inference. If you used the end segment, for example, the autoregressive component wouldn’t have future context at training time and the mismatch might cause problems during generation.Hope that explains my thinking. If anything is unclear let me know.
One of the negative side effects of only using the middle segment is that there are sometimes small artifacts at the beginning or end of the generated audio. For the best quality it might be worth putting in some extra time to train on the entire segment.
Hello @bshall ! Thank you for the awesome repo. Your code is very clean, I’m impressed. I’m playing a bit with your implementation and I have a question. Why do you take middle of the mel segment? Why not just the end? is there a benefit of having the padding at the end?
Thank you!!