Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

inconsistent output for duration and frames after using mfcc function with n_fft and hop_length inputs

See original GitHub issue

I have an .wav audiofile, for which I want to calculate the mfccs usingn_fft and hop_length inputs. However, when I do so, I get inconsistent results for duration and number of frames, as shown in the below code:

y, sr = librosa.load(fn, sr=16000)
y.shape

OUT: (671744,)

duration = librosa.get_duration(y, sr=sr)
duration

OUT: 41.984

mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13, n_fft=800, hop_length=400)
mfccs.shape

OUT: (13, 1680)

librosa.frames_to_time(mfccs.shape[1], sr=sr, n_fft=800, hop_length=400)

OUT: 42.024999999999999

print('frames=', librosa.time_to_frames(times=duration, sr=sr, n_fft=800, hop_length=400))
print('samples=', librosa.time_to_samples(times=duration, sr=sr))

OUT:
frames= 1678
samples= 671744

So as you can see, the duration is 42.025 as calculated from the frames, whereas as calculated from the time series it is 41.984. Similarly, the numbre of frames in the mfccs array is 1680, whereas librosa.time_to_frames says 1678, which is what I would expect as well. Do you have any ideas why this is happening? Many thanks!

Issue Analytics

State:
Created 5 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

2reactions

bmcfeecommented, Mar 23, 2018

The issue is that frame-based methods use centered windows (by default), which invokes a bit of padding at either end of the signal. You can use get_duration on the raw sample array, which will be accurate up to the sample rate, and not account for any frame padding, so there will be a discrepancy there.

Relatedly, frames->time conversion is only accurate up to the frame rate of the signal, so any samples that don’t quite fill a frame will not be accounted for. Or, if there’s padding involved, the duration will be slightly longer than the signal actually is. Either way, it’s going to be pinned to the frame rate, which is (much) lower than the sample rate.

Including the n_fft parameter in frames_to_time can compensate for a bit of padding on either side, but if you really want a sample-accurate duration, it’s best to go from the audio buffer itself.

I hope that clears things up.

1reaction

bmcfeecommented, Mar 23, 2018

These two lines return the same shape but have different values:

Yes, they’re using different window lengths, but projecting down to the same number of mfccs. The hop length is the same, so you get the same shape out.

I would like to guarantee that there are 800 samples in the window when calculating the mfcc

That’s done by saying n_fft=800 in mfcc. You won’t need to include that parameter when converting frame count to time though, for reasons listed above.

I hope that helps – I get that this is a little confusing, and the documentation could be better for this. If you have any suggestions for how to improve, I’d be happy to incorporate changes.