Why are features at STFT based upon n_fft and not fft_size?
See original GitHub issueDescribe the bug
https://librosa.org/librosa/generated/librosa.core.stft.html
'n_fft:int > 0 [scalar] length of the windowed signal after padding with zeros. The number of rows in the STFT matrix D is (1 + n_fft/2). The default value, n_fft=2048 samples, corresponds to a physical duration of 93 milliseconds at a sample rate of 22050 Hz, i.e. the default sample rate in librosa. ’
How can n_fft=2048 samples, corresponds to a physical duration of 93 milliseconds at a sample rate of 22050 Hz unless you assume n_samples = 190464?
Or is n_fft actually the fft_size, which would in fact correspond to 93 ms per window? Is the documentation wrong?
https://librosa.org/librosa/glossary.html indicates that n_fft is the window_size. But with 2.5 seconds or 22050 Hz audio, we get 109 frames from every feature extractor (except fourier_tempogram which bizarrely has 110 frames), using the default params for every feature extractor. This corresponds to 505.7339 as the window_size, which doesn’t make sense to us.
Expected behavior The documentation should be updated to describe the behavior of n_fft assuming different audio lengths, or changed to fft_size.
If I use 512
or 2048
with 2.533 sec of audio, I shouldn’t have 109 windows for both. But I do:
if feat in [chroma_cqt, chroma_cens]:
z = feat(x, window=512)
elif feat in [rms, zero_crossing_rate]:
z = feat(x, frame_length=512)
elif feat in [tonnetz]:
z = feat(x, chroma=chroma_cqt(x, window=512))
else:
z = feat(x, n_fft=512)
and I get 109 windows, same as if I change this to 2048. However, I expect the number of windows to change.
Software versions Darwin-18.7.0-x86_64-i386-64bit Python 3.7.7 (default, Mar 10 2020, 15:43:03) [Clang 11.0.0 (clang-1100.0.33.17)] NumPy 1.18.5 SciPy 1.4.1 librosa 0.7.2 INSTALLED VERSIONS python: 3.7.7 (default, Mar 10 2020, 15:43:03) [Clang 11.0.0 (clang-1100.0.33.17)]
librosa: 0.7.2
audioread: 2.1.8 numpy: 1.18.5 scipy: 1.4.1 sklearn: 0.22.2.post1 joblib: 0.15.1 decorator: 4.3.0 six: 1.15.0 soundfile: 0.10.2 resampy: 0.2.2 numba: 0.43.0
numpydoc: None sphinx: None sphinx_rtd_theme: None sphinxcontrib.versioning: None sphinx-gallery: None pytest: None pytest-mpl: None pytest-cov: None matplotlib: 3.2.1 presets: None
Additional context We want to change all https://librosa.org/librosa/feature.html feature extractors to extract 100ms windows.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:8 (5 by maintainers)
Top GitHub Comments
Okay thank you, I understand.
So what I want to adjust is hop_size. And when it comes to window:
“Smaller values improve the temporal resolution of the STFT (i.e. the ability to discriminate impulses that are closely spaced in time) at the expense of frequency resolution (i.e. the ability to discriminate pure tones that are closely spaced in frequency). This effect is known as the time-frequency localization tradeoff and needs to be adjusted according to the properties of the input signal y.”
I understand about the tempo features.
By the way, all of the spectral features except tonnetz have hop_length. Why can tonnetz not use this parameter? It seemed fixed at 512.
One more point: CQT features (e.g. chroma_cqt) do not use fixed window lengths, but adapt the window length for each analysis frequency. You can of course change the hop length though, and this is what actually controls the output frame rate.
We’re not changing the parameter name.