Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[DeepSpeech >= 0.5.x] C++ MFCC implementation

See original GitHub issue

tl;dr

it looks like the C++ implementation gradients need to be registered with the @tf.RegisterGradient("Mfcc") decorator. So someone who has experience with gradient registration might be needed here (that’s not me…).

From what I can tell, this looks like what is needed:

Either gradient registration, simply import & use the C++ MFCC implementation:


from util.feeding import samples_to_mfccs

...

features, features_len = samples_to_mfccs(audio)

@tf.RegisterGradients("Mfcc")
def grads(...):
    ...
    return ....

...

(Or estimate the Feature Extraction using an implementation of tf.signal or the existing implementation (both of which don’t currently match the C++ implementation outputs),)
Minor change tf.logits.py to call create_model which also handles the windowing for us using the overlap keyword arg:

DeepSpeech.create_model(features, features_len, [0]*10, overlap=True)

Further details

DeepSpeech >= v0.5.x now uses the C++ defined implementations tensorflow.contrib.framework.python.ops.audio_ops.mfcc() in feature extraction, with a custom windowing function (create_overlapping_windows in DeepSpeech.py).

The resulting python code for the C++ implementations of both mfcc() and audio_spectrogram() are built into this file during install: ... /tensorflow/contrib/framework/python/ops/gen_audio_ops.py.

The definitions for their settings are located here

It seems that the the mfcc and audio_spectrogram functions will register gradients with the graph when running in eagerly (that’s what it looks like according to the code, have not tested this…). However, have tried running on a session basis and the gradients do not get registered.

I’ve done some tests between tf.signal gpu + gradient registered methods vs. C++. I get different results using the example here (but using spectrograms = tf.square(tf.abs(stfts)) using the defined C++ parameters from the file linked above) vs. the C++ implementation.

Example output Tensor for mfcc[0][10][:26] where: A = C++, B = tf.signal against sample-000000.wav

A tf.Tensor(
[-9.541672    0.7038873  -0.01703602  2.4102588   3.699989    2.9941733
  0.05728421 -0.39552015  0.41694957  0.1773318  -0.94534826 -0.07476918
 -0.72920483  0.04988448  0.03030526  0.70554906 -1.4502583   0.25273177
  0.06876066 -0.79810566  0.11811491 -0.5782758  -0.34861287  0.8405349
  0.19752231 -0.07848166], shape=(26,), dtype=float32)
B tf.Tensor(
[-16.386528     3.151216    -0.13298927   4.371477     6.127008
   5.412611     0.2834479   -0.65268487   0.40095583   0.2454046
  -0.5782857   -0.423165    -0.8718532   -0.24084271   0.44111174
   1.492601    -1.9069418    0.2948684   -0.45236063  -0.5574946
   0.21208026  -0.6371893   -0.29641697   0.44038427   0.4952675
  -0.30927685], shape=(26,), dtype=float32)

Example output Tensor for spectogram[0][10][:26] where: A = C++, B = tf.signal against sample-000000.wav Note the minor differences -> CPU vs GPU floating points. Any implementation using tf.signal would likely be an estimation of the way DeepSpeech does it. Whether that’s close enough, I’m not sure. Would need to run some tests/have a think…

A tf.Tensor(
[2.5844438e-02 1.4196590e+01 7.8703909e+00 1.4777693e+00 9.0667505e+00
 1.7904934e+00 4.6659380e-02 1.6692156e-01 2.9524408e-02 1.2653788e-02
 1.4637280e-02 5.9227698e-04 1.2959058e-03 8.3363376e-04 2.5570781e-03
 4.5909737e-03 2.1686291e-03 5.4116650e-03 8.5209711e-03 1.0986484e-03
 9.5019070e-03 9.1623375e-03 5.7589263e-03 2.1361437e-02 3.0068232e-02
 1.6542105e-02], shape=(26,), dtype=float32)
B tf.Tensor(
[2.58445051e-02 1.41965885e+01 7.87038898e+00 1.47776914e+00
 9.06674957e+00 1.79049325e+00 4.66593951e-02 1.66921541e-01
 2.95244064e-02 1.26537923e-02 1.46372588e-02 5.92276279e-04
 1.29591115e-03 8.33630445e-04 2.55708303e-03 4.59098117e-03
 2.16862885e-03 5.41166542e-03 8.52095429e-03 1.09864434e-03
 9.50190984e-03 9.16234031e-03 5.75892068e-03 2.13614274e-02
 3.00682262e-02 1.65421031e-02], shape=(26,), dtype=float32)