[DeepSpeech >= 0.5.x] C++ MFCC implementation
See original GitHub issuetl;dr
it looks like the C++ implementation gradients need to be registered with the @tf.RegisterGradient("Mfcc")
decorator. So someone who has experience with gradient registration might be needed here (that’s not me…).
From what I can tell, this looks like what is needed:
- Either gradient registration, simply import & use the C++ MFCC implementation:
from util.feeding import samples_to_mfccs
...
features, features_len = samples_to_mfccs(audio)
@tf.RegisterGradients("Mfcc")
def grads(...):
...
return ....
...
- (Or estimate the Feature Extraction using an implementation of
tf.signal
or the existing implementation (both of which don’t currently match the C++ implementation outputs),) - Minor change
tf.logits.py
to callcreate_model
which also handles the windowing for us using theoverlap
keyword arg:
DeepSpeech.create_model(features, features_len, [0]*10, overlap=True)
Further details
DeepSpeech >= v0.5.x now uses the C++ defined implementations tensorflow.contrib.framework.python.ops.audio_ops.mfcc()
in feature extraction, with a custom windowing function (create_overlapping_windows
in DeepSpeech.py).
The resulting python code for the C++ implementations of both mfcc()
and audio_spectrogram()
are built into this file during install: ... /tensorflow/contrib/framework/python/ops/gen_audio_ops.py
.
The definitions for their settings are located here
It seems that the the mfcc
and audio_spectrogram
functions will register gradients with the graph when running in eagerly (that’s what it looks like according to the code, have not tested this…). However, have tried running on a session basis and the gradients do not get registered.
I’ve done some tests between tf.signal
gpu + gradient registered methods vs. C++. I get different results using the example here (but using spectrograms = tf.square(tf.abs(stfts))
using the defined C++ parameters from the file linked above) vs. the C++ implementation.
Example output Tensor for mfcc[0][10][:26]
where: A = C++, B = tf.signal
against sample-000000.wav
A tf.Tensor(
[-9.541672 0.7038873 -0.01703602 2.4102588 3.699989 2.9941733
0.05728421 -0.39552015 0.41694957 0.1773318 -0.94534826 -0.07476918
-0.72920483 0.04988448 0.03030526 0.70554906 -1.4502583 0.25273177
0.06876066 -0.79810566 0.11811491 -0.5782758 -0.34861287 0.8405349
0.19752231 -0.07848166], shape=(26,), dtype=float32)
B tf.Tensor(
[-16.386528 3.151216 -0.13298927 4.371477 6.127008
5.412611 0.2834479 -0.65268487 0.40095583 0.2454046
-0.5782857 -0.423165 -0.8718532 -0.24084271 0.44111174
1.492601 -1.9069418 0.2948684 -0.45236063 -0.5574946
0.21208026 -0.6371893 -0.29641697 0.44038427 0.4952675
-0.30927685], shape=(26,), dtype=float32)
Example output Tensor for spectogram[0][10][:26]
where: A = C++, B = tf.signal
against sample-000000.wav
Note the minor differences -> CPU vs GPU floating points. Any implementation using tf.signal
would likely be an estimation of the way DeepSpeech does it. Whether that’s close enough, I’m not sure. Would need to run some tests/have a think…
A tf.Tensor(
[2.5844438e-02 1.4196590e+01 7.8703909e+00 1.4777693e+00 9.0667505e+00
1.7904934e+00 4.6659380e-02 1.6692156e-01 2.9524408e-02 1.2653788e-02
1.4637280e-02 5.9227698e-04 1.2959058e-03 8.3363376e-04 2.5570781e-03
4.5909737e-03 2.1686291e-03 5.4116650e-03 8.5209711e-03 1.0986484e-03
9.5019070e-03 9.1623375e-03 5.7589263e-03 2.1361437e-02 3.0068232e-02
1.6542105e-02], shape=(26,), dtype=float32)
B tf.Tensor(
[2.58445051e-02 1.41965885e+01 7.87038898e+00 1.47776914e+00
9.06674957e+00 1.79049325e+00 4.66593951e-02 1.66921541e-01
2.95244064e-02 1.26537923e-02 1.46372588e-02 5.92276279e-04
1.29591115e-03 8.33630445e-04 2.55708303e-03 4.59098117e-03
2.16862885e-03 5.41166542e-03 8.52095429e-03 1.09864434e-03
9.50190984e-03 9.16234031e-03 5.75892068e-03 2.13614274e-02
3.00682262e-02 1.65421031e-02], shape=(26,), dtype=float32)
Issue Analytics
- State:
- Created 4 years ago
- Comments:6
Top GitHub Comments
Have you had any success in rewriting the code for DS 0.5.x?
(misclick)