Explain difference to torch.stft
See original GitHub issueYou mentioned in the readme that
Other GPU audio processing tools are torchaudio and tf.signal. But they are not using the neural network approach, and hence the Fourier basis can not be trained.
Can you explain this in more detail, please?
-
when would I benefit from the STFT in nnAudio compared to let’s say
torch.stft? -
does it make a difference which STFT I use when I am interested in a time domain loss, hence does it change backprop?
Thanks!
Issue Analytics
- State:
- Created 4 years ago
- Comments:14 (5 by maintainers)
Top Results From Across the Web
torch.stft — PyTorch 1.13 documentation
The STFT computes the Fourier transform of short overlapping windows of the input. This giving frequency components of the signal as they change...
Read more >Implementing STFT with Pytorch gives a slightly different result ...
The difference is from the difference between their default bit. NumPy's float is 64bit by default. PyTorch's float is 32bit by default.
Read more >PyTorch STFT generates weird vertical bars - Reddit
I have finally found the source of the difference. torch.stft defaults to a rectangular window (no window), librosa and torchaudio default ...
Read more >Signal Processing Theory and Practice with PyTorch - Kaggle
import torch from torch.fft import fft # The FFT results are complex ... in short, they are made from STFT as well and...
Read more >normalize STFT output by magnitude
The output of STFT (torch real tensor S) has the last dimension containing real and imaginary part. Is pow(2).sum(-1) again some normalization ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

@carlthome The speech processing community long ago decided that win_length=400 is a good choice for time resolution, so it’s basically standardized for speech recognition. You’re right that FFT needs power of 2, and 512 is the closest to 400. The Hann window is applied to the 400 input samples, zero pad to 512, then FFT. If you look at any speech processing pipelines this is how it’s done (in Kaldi, ESPNet, etc.).
Thanks for the interest in nnAudio. nnAudio is still in the early stage of development, if you find any bugs and problems, free feel to ask here again, I will try my best to solve the problems and improve nnAudio.
torch.stftis a really good option, since no extra dependency is required for it unlike torchaudio. What makesnnAudiodifferent fromtorch.stftis the trainable stft.