Problems with Kaldi MFCCs
See original GitHub issueHi, thank you very much for this very useful project.
I started doing some speech recognition experiments with the MFCC features implemented in torchaudio. In particular, I tried the librosa ones implemented in torchaudio/transorms.py and the kaldi-ones implemented in torchaudio/compliance/kaldi.py.
-
The librosa features are computed very efficiently and I can achieve results similar to that of the original kaldi features when changing some hyperparameters (i.e, n_mfcc=13, hop_length=160,n_mels=23,f_min=80,f_max=7900).
-
When switching to the kaldi implemented features, however, my neural network doesn’t even converge. I suspect there a bug somewhere. I tried to compare the original kaldi mfccs with the ones implemented in torchaudio and they look very different (dithering only cannot explain such a big difference):
mfcc_original
array([35.84189 , 39.748493, 35.40782 , 33.237488, 34.53969 , 35.40782 ,
34.973755, 35.40782 , 35.40782 , 35.84189 ], dtype=float32)
mfcc_torch
tensor([29.3794, 29.1657, 28.7020, 27.4892, 29.1944, 27.8915, 29.3321, 28.8958, 28.4197,
29.0967])
The other issue is that the current version doesn’t support cuda and it can only process up to two-channels at a time. Also, the current version is significantly slower than the librosa implementation (there could be a bottleneck somewhere).
Any idea? Hope my feedback would be helpful
Thank you
Mirco
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:7 (2 by maintainers)
Top GitHub Comments
Update: I have gone back to the spectrogram level trying to find the bug. If I set the flag subtract_mean to true in both kaldi and pytorch, the resulting spectrogram is (almost) the same. But if I set is as false (which is default), the results are different: they have the same pattern but the mean is different.
Kaldi code to generate spectrograms with mean subtraction:
~/kaldi/src/featbin/compute-spectrogram-feats --subtract-mean=true --dither=0.0 --energy-floor=1.0 scp,p:wav.scp ark:generated_feats/spec.ark
Pytorch code to generate spectrograms with mean subtraction:
torch_feats = torchaudio_local.spectrogram(waveform=audio_tensor,dither=0,energy_floor=1.0,subtract_mean=True)
Result:
Kaldi code to generate spectrograms without mean subtraction:
~/kaldi/src/featbin/compute-spectrogram-feats --subtract-mean=false --dither=0.0 --energy-floor=1.0 scp,p:wav.scp ark:generated_feats/spec.ark
Pytorch code to generate spectrograms with mean subtraction:
torch_feats = torchaudio_local.spectrogram(waveform=audio_tensor,dither=0,energy_floor=1.0,subtract_mean=False)
Result:
I suspect that there is something on the FFT computation that is normalizing in one but not in other. Any thoughts?
Here is my example of
3
and4
:Kaldi are tensor
1
torch.kaldi are tensor2
torch.kaldi again are tensor3
All different