Output audio duration does not exactly match input audio.
See original GitHub issueRunning through your pre-trained models, I found that generated audio does not exactly match the input in duration length. For example,
wav, sr = load_wav(os.path.join(a.input_wavs_dir, filname))
wav = wav / MAX_WAV_VALUE
wav = torch.FloatTensor(wav).to(device) # wav shape is torch.Size([71334])
x = get_mel(wav.unsqueeze(0)) # x shape is torch.Size([1, 80, 278])
y_g_hat = generator(x) # y_g_hat shape is torch.Size([1, 1, 71168])
As you can see, there is a mismatch of 71334 and 71168. What is happening, and why is this the case? Is there a way that I can change it so that the input and output shapes match?
Thank you.
Edit: So I was checking training, and if the target segment_size is a multiple of 256 (hop_size), then y_g_hat = generator(x) will also have the exact number.
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (2 by maintainers)
Top Results From Across the Web
ffmpeg add audio but keep video length the same (not -shortest)
My answer addresses the circumstance where the input video is longer than audio file.
Read more >FFMPEG amix filter volume issue with inputs of different duration
It works fine if input files have equal duration. In that case volume is dropped in constant value and could be fixed with...
Read more >977 (stream copy duration does not match the one specified by
Hello, i'm trying to cut an Adobe postprocessed F4v video (h.264 + mp3) with this command : ffmpeg -ss 00:10:00 -t 10 -i...
Read more >Video converted by FFMPEG has a different duration, why?
I would think that it depends on your input and output codecs. Long GOP codecs might change the video stream duration in order...
Read more >How To Fix The Sample Rate of The Audio Input and Output ...
Here is how to fix the error "The Sample Rate of The Audio Input and Output Device Do Not Match. Audio can not...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

This mismatch is cause padding and transposed convolution, you should set
segment_size % hop_size == 0(segment_size + (nfft - hop_size can get segmentsize % hop_size frames melspectrum). In other words, one frame represent hop_size sampling points.Thank you. It is correct to adjust chunk_size to be divided by hop_size. I don’t know what sample rate you’re using, but 10ms chunk seems too short to generate high quality audio considering the receptive field of the generator.