Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Output audio duration does not exactly match input audio.

See original GitHub issue

Running through your pre-trained models, I found that generated audio does not exactly match the input in duration length. For example,

wav, sr = load_wav(os.path.join(a.input_wavs_dir, filname))
wav = wav / MAX_WAV_VALUE
wav = torch.FloatTensor(wav).to(device)  # wav shape is torch.Size([71334])
x = get_mel(wav.unsqueeze(0))  # x shape is torch.Size([1, 80, 278])
y_g_hat = generator(x)  # y_g_hat shape is torch.Size([1, 1, 71168])

As you can see, there is a mismatch of 71334 and 71168. What is happening, and why is this the case? Is there a way that I can change it so that the input and output shapes match?

Thank you.

Edit: So I was checking training, and if the target segment_size is a multiple of 256 (hop_size), then y_g_hat = generator(x) will also have the exact number.

Issue Analytics

State:
Created 3 years ago
Comments:9 (2 by maintainers)

Top GitHub Comments

2reactions

Miralancommented, Dec 21, 2020

This mismatch is cause padding and transposed convolution, you should set segment_size % hop_size == 0(segment_size + (nfft - hop_size can get segmentsize % hop_size frames melspectrum). In other words, one frame represent hop_size sampling points.

1reaction

jik876commented, Dec 22, 2020

Thank you. It is correct to adjust chunk_size to be divided by hop_size. I don’t know what sample rate you’re using, but 10ms chunk seems too short to generate high quality audio considering the receptive field of the generator.