Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Why are features at STFT based upon n_fft and not fft_size?

See original GitHub issue

Describe the bug

https://librosa.org/librosa/generated/librosa.core.stft.html

'n_fft:int > 0 [scalar] length of the windowed signal after padding with zeros. The number of rows in the STFT matrix D is (1 + n_fft/2). The default value, n_fft=2048 samples, corresponds to a physical duration of 93 milliseconds at a sample rate of 22050 Hz, i.e. the default sample rate in librosa. ’

How can n_fft=2048 samples, corresponds to a physical duration of 93 milliseconds at a sample rate of 22050 Hz unless you assume n_samples = 190464?

Or is n_fft actually the fft_size, which would in fact correspond to 93 ms per window? Is the documentation wrong?

https://librosa.org/librosa/glossary.html indicates that n_fft is the window_size. But with 2.5 seconds or 22050 Hz audio, we get 109 frames from every feature extractor (except fourier_tempogram which bizarrely has 110 frames), using the default params for every feature extractor. This corresponds to 505.7339 as the window_size, which doesn’t make sense to us.

Expected behavior The documentation should be updated to describe the behavior of n_fft assuming different audio lengths, or changed to fft_size.

If I use 512 or 2048 with 2.533 sec of audio, I shouldn’t have 109 windows for both. But I do:

        if feat in [chroma_cqt, chroma_cens]:
            z = feat(x, window=512)
        elif feat in [rms, zero_crossing_rate]:
            z = feat(x, frame_length=512)
        elif feat in [tonnetz]:
            z = feat(x, chroma=chroma_cqt(x, window=512))
        else:
            z = feat(x, n_fft=512)

and I get 109 windows, same as if I change this to 2048. However, I expect the number of windows to change.

Software versions Darwin-18.7.0-x86_64-i386-64bit Python 3.7.7 (default, Mar 10 2020, 15:43:03) [Clang 11.0.0 (clang-1100.0.33.17)] NumPy 1.18.5 SciPy 1.4.1 librosa 0.7.2 INSTALLED VERSIONS python: 3.7.7 (default, Mar 10 2020, 15:43:03) [Clang 11.0.0 (clang-1100.0.33.17)]

librosa: 0.7.2

audioread: 2.1.8 numpy: 1.18.5 scipy: 1.4.1 sklearn: 0.22.2.post1 joblib: 0.15.1 decorator: 4.3.0 six: 1.15.0 soundfile: 0.10.2 resampy: 0.2.2 numba: 0.43.0

numpydoc: None sphinx: None sphinx_rtd_theme: None sphinxcontrib.versioning: None sphinx-gallery: None pytest: None pytest-mpl: None pytest-cov: None matplotlib: 3.2.1 presets: None

Additional context We want to change all https://librosa.org/librosa/feature.html feature extractors to extract 100ms windows.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

turiancommented, Jun 24, 2020

Okay thank you, I understand.

So what I want to adjust is hop_size. And when it comes to window:

“Smaller values improve the temporal resolution of the STFT (i.e. the ability to discriminate impulses that are closely spaced in time) at the expense of frequency resolution (i.e. the ability to discriminate pure tones that are closely spaced in frequency). This effect is known as the time-frequency localization tradeoff and needs to be adjusted according to the properties of the input signal y.”

I understand about the tempo features.

By the way, all of the spectral features except tonnetz have hop_length. Why can tonnetz not use this parameter? It seemed fixed at 512.

1reaction

bmcfeecommented, Jun 22, 2020

One more point: CQT features (e.g. chroma_cqt) do not use fixed window lengths, but adapt the window length for each analysis frequency. You can of course change the hop length though, and this is what actually controls the output frame rate.

or changed to fft_size.

We’re not changing the parameter name.

Top Results From Across the Web

Applications of the STFT | Spectral Audio Signal Processing

The covariance method of LP is based on an unbiased autocorrelation estimate (see Eq. $ \,$ (6.4)). As a result, it gives more...

Short-Time Fourier Analysis Why STFT for Speech Signals ...

Short-Time Fourier Transform. • speech is not a stationary signal, i.e., it has properties that change with time. • thus a single representation...

Short-time Fourier transform - Wikipedia

The short-time Fourier transform (STFT), is a Fourier-related transform used to determine the sinusoidal frequency and phase content of local sections of a ......

Simplifying Audio Data: FFT, STFT & MFCC | by Ankur Dhuriya

What we should know about sound. Sound is produced when there's an object that vibrates and those vibrations determine the oscillation of ...

Invertible STFT and ISTFT in Python - scipy - Stack Overflow

import scipy, numpy as np def stft(x, fftsize=1024, overlap=4): hop = fftsize / overlap w ... Found another STFT, but no corresponding inverse...