Which library is torchaudio consistent with?
See original GitHub issueHi, I’m currently updating my torch codebase from using librosa to torchaudio for transforms, to take advantage of the (much) faster stft torch implementation on the GPU. However, running into several occasions where the output from Spectrogram vs. librosa.core._spectrogram, MelSpectrogram vs. librosa.melspectrogram have different results. Does this repo ensure consistency with another python audio library for those transformations? I think it would be good to have consistency with another widely used library. Currently figuring out the correct params to ensure consistency and I can PR something if that sounds useful.
For example:
sound, sample_rate = torchaudio.load('wav_file.wav')
sound = sound
sound_librosa = sound.cpu().numpy().squeeze().T
sample_rate = 16000
n_mels = 40
window_stride = 0.01
window_size = 0.025
hop_length = int(sample_rate * window_stride)
n_fft = int(sample_rate * window_size)
stft_librosa = librosa.stft(y=sound_librosa,
hop_length=hop_length,
n_fft=n_fft)
spectro_librosa, n_fft = librosa.core.spectrum._spectrogram(y=sound_librosa,
hop_length=hop_length,
n_fft=n_fft, power=2)
mel_basis = librosa.filters.mel(sample_rate,
n_mels=n_mels,
n_fft=n_fft,
norm=None, # non-standard
htk=True) # non-standard
check = np.dot(mel_basis, spectro_librosa)
stft_torch = torch.stft(soundcuda,
hop_length=hop_length,
n_fft=n_fft,
window=window).transpose(1, 2)
spectro_torch = stft_torch.pow(2).sum(-1)
melscale = torchaudio.transforms.MelScale(n_mels=n_mels)
check2 = melscale(check)
#check == check2
The torchaudio MelScale uses the non-default librosa options norm=None, htk=True on librosa.filters.mel (https://librosa.github.io/librosa/_modules/librosa/filters.html#mel). I also removed the default spectrogram normalization at https://github.com/pytorch/audio/blob/master/torchaudio/transforms.py#L198, which is not a librosa option.
There’s also functional inconsistencies between the librosa and torchaudio function calls – librosa returns a spectrogram with librosa.feature.melspectrogram, whereas torchaudio converts the spectrogram to the Db scale.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:2
- Comments:6 (6 by maintainers)
Thanks! The torch.stft implementation is consistent with librosa defaults if you use the same window type, I tested that first.
I’m interested in the transforms.py function signatures being consistent and any transforms being visible/configurable (e.g. the spectrogram normalization), but didn’t know if these functions were modeled off some library. Personally, I think it makes sense to mirror librosa functionality for functions like MelSpectrogram, since librosa seems to be the most popular python audio library. What do you think? I’m sure the current torchaudio default implementations also make sense for some applications. Happy to contribute if we want to make torchaudio consistent with librosa output for those common speech-to-text transforms
Closing this issue, since PR got merged. Please feel free to re-open 😃