question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

inconsistent output for duration and frames after using mfcc function with n_fft and hop_length inputs

See original GitHub issue

I have an .wav audiofile, for which I want to calculate the mfccs usingn_fft and hop_length inputs. However, when I do so, I get inconsistent results for duration and number of frames, as shown in the below code:

y, sr = librosa.load(fn, sr=16000)
y.shape
OUT: (671744,)
duration = librosa.get_duration(y, sr=sr)
duration
OUT: 41.984
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13, n_fft=800, hop_length=400)
mfccs.shape
OUT: (13, 1680)
librosa.frames_to_time(mfccs.shape[1], sr=sr, n_fft=800, hop_length=400)
OUT: 42.024999999999999
print('frames=', librosa.time_to_frames(times=duration, sr=sr, n_fft=800, hop_length=400))
print('samples=', librosa.time_to_samples(times=duration, sr=sr))
OUT:
frames= 1678
samples= 671744

So as you can see, the duration is 42.025 as calculated from the frames, whereas as calculated from the time series it is 41.984. Similarly, the numbre of frames in the mfccs array is 1680, whereas librosa.time_to_frames says 1678, which is what I would expect as well. Do you have any ideas why this is happening? Many thanks!

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
bmcfeecommented, Mar 23, 2018

The issue is that frame-based methods use centered windows (by default), which invokes a bit of padding at either end of the signal. You can use get_duration on the raw sample array, which will be accurate up to the sample rate, and not account for any frame padding, so there will be a discrepancy there.

Relatedly, frames->time conversion is only accurate up to the frame rate of the signal, so any samples that don’t quite fill a frame will not be accounted for. Or, if there’s padding involved, the duration will be slightly longer than the signal actually is. Either way, it’s going to be pinned to the frame rate, which is (much) lower than the sample rate.

Including the n_fft parameter in frames_to_time can compensate for a bit of padding on either side, but if you really want a sample-accurate duration, it’s best to go from the audio buffer itself.

I hope that clears things up.

1reaction
bmcfeecommented, Mar 23, 2018

These two lines return the same shape but have different values:

Yes, they’re using different window lengths, but projecting down to the same number of mfccs. The hop length is the same, so you get the same shape out.

I would like to guarantee that there are 800 samples in the window when calculating the mfcc

That’s done by saying n_fft=800 in mfcc. You won’t need to include that parameter when converting frame count to time though, for reasons listed above.

I hope that helps – I get that this is a little confusing, and the documentation could be better for this. If you have any suggestions for how to improve, I’d be happy to incorporate changes.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Is my output of librosa MFCC correct? I think I get the wrong ...
Yes, it is correct. Long answer. You are using a time-series as input ( signal ), which means that librosa first computes a...
Read more >
librosa.feature.mfcc — librosa 0.10.0.dev0 documentation
If multi-channel audio input y is provided, the MFCC calculation will depend on the peak ... Using a different hop length and HTK-style...
Read more >
MFCC's Made Easy. An easy explanation of an important…
The MFCC feature extraction process is basically a 6-step process: Frame the signal into short frames : We need to split the signal...
Read more >
Speech Processing for Machine Learning: Filter banks, Mel ...
Understanding and computing filter banks and MFCCs and a ... After slicing the signal into frames, we apply a window function such as...
Read more >
Mel Frequency Cepstral Coefficient (MFCC) tutorial
When we calculate the complex DFT, we get - where the denotes the frame number corresponding to the time-domain frame. is then the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found