inconsistent output for duration and frames after using mfcc function with n_fft and hop_length inputs
See original GitHub issueI have an .wav audiofile, for which I want to calculate the mfccs usingn_fft
and hop_length
inputs. However, when I do so, I get inconsistent results for duration and number of frames, as shown in the below code:
y, sr = librosa.load(fn, sr=16000)
y.shape
OUT: (671744,)
duration = librosa.get_duration(y, sr=sr)
duration
OUT: 41.984
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13, n_fft=800, hop_length=400)
mfccs.shape
OUT: (13, 1680)
librosa.frames_to_time(mfccs.shape[1], sr=sr, n_fft=800, hop_length=400)
OUT: 42.024999999999999
print('frames=', librosa.time_to_frames(times=duration, sr=sr, n_fft=800, hop_length=400))
print('samples=', librosa.time_to_samples(times=duration, sr=sr))
OUT:
frames= 1678
samples= 671744
So as you can see, the duration is 42.025 as calculated from the frames, whereas as calculated from the time series it is 41.984. Similarly, the numbre of frames in the mfccs
array is 1680, whereas librosa.time_to_frames
says 1678, which is what I would expect as well.
Do you have any ideas why this is happening? Many thanks!
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Is my output of librosa MFCC correct? I think I get the wrong ...
Yes, it is correct. Long answer. You are using a time-series as input ( signal ), which means that librosa first computes a...
Read more >librosa.feature.mfcc — librosa 0.10.0.dev0 documentation
If multi-channel audio input y is provided, the MFCC calculation will depend on the peak ... Using a different hop length and HTK-style...
Read more >MFCC's Made Easy. An easy explanation of an important…
The MFCC feature extraction process is basically a 6-step process: Frame the signal into short frames : We need to split the signal...
Read more >Speech Processing for Machine Learning: Filter banks, Mel ...
Understanding and computing filter banks and MFCCs and a ... After slicing the signal into frames, we apply a window function such as...
Read more >Mel Frequency Cepstral Coefficient (MFCC) tutorial
When we calculate the complex DFT, we get - where the denotes the frame number corresponding to the time-domain frame. is then the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The issue is that frame-based methods use centered windows (by default), which invokes a bit of padding at either end of the signal. You can use
get_duration
on the raw sample array, which will be accurate up to the sample rate, and not account for any frame padding, so there will be a discrepancy there.Relatedly, frames->time conversion is only accurate up to the frame rate of the signal, so any samples that don’t quite fill a frame will not be accounted for. Or, if there’s padding involved, the duration will be slightly longer than the signal actually is. Either way, it’s going to be pinned to the frame rate, which is (much) lower than the sample rate.
Including the
n_fft
parameter inframes_to_time
can compensate for a bit of padding on either side, but if you really want a sample-accurate duration, it’s best to go from the audio buffer itself.I hope that clears things up.
Yes, they’re using different window lengths, but projecting down to the same number of mfccs. The hop length is the same, so you get the same shape out.
That’s done by saying
n_fft=800
in mfcc. You won’t need to include that parameter when converting frame count to time though, for reasons listed above.I hope that helps – I get that this is a little confusing, and the documentation could be better for this. If you have any suggestions for how to improve, I’d be happy to incorporate changes.