Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[DeepSpeech >= 0.5.x] C++ MFCC implementation

See original GitHub issue


it looks like the C++ implementation gradients need to be registered with the @tf.RegisterGradient("Mfcc") decorator. So someone who has experience with gradient registration might be needed here (that’s not me…).

From what I can tell, this looks like what is needed:

  • Either gradient registration, simply import & use the C++ MFCC implementation:

from util.feeding import samples_to_mfccs


features, features_len = samples_to_mfccs(audio)

def grads(...):
    return ....


  • (Or estimate the Feature Extraction using an implementation of tf.signal or the existing implementation (both of which don’t currently match the C++ implementation outputs),)
  • Minor change to call create_model which also handles the windowing for us using the overlap keyword arg:
DeepSpeech.create_model(features, features_len, [0]*10, overlap=True)

Further details

DeepSpeech >= v0.5.x now uses the C++ defined implementations tensorflow.contrib.framework.python.ops.audio_ops.mfcc() in feature extraction, with a custom windowing function (create_overlapping_windows in

The resulting python code for the C++ implementations of both mfcc() and audio_spectrogram() are built into this file during install: ... /tensorflow/contrib/framework/python/ops/

The definitions for their settings are located here

It seems that the the mfcc and audio_spectrogram functions will register gradients with the graph when running in eagerly (that’s what it looks like according to the code, have not tested this…). However, have tried running on a session basis and the gradients do not get registered.

I’ve done some tests between tf.signal gpu + gradient registered methods vs. C++. I get different results using the example here (but using spectrograms = tf.square(tf.abs(stfts)) using the defined C++ parameters from the file linked above) vs. the C++ implementation.

Example output Tensor for mfcc[0][10][:26] where: A = C++, B = tf.signal against sample-000000.wav

A tf.Tensor(
[-9.541672    0.7038873  -0.01703602  2.4102588   3.699989    2.9941733
  0.05728421 -0.39552015  0.41694957  0.1773318  -0.94534826 -0.07476918
 -0.72920483  0.04988448  0.03030526  0.70554906 -1.4502583   0.25273177
  0.06876066 -0.79810566  0.11811491 -0.5782758  -0.34861287  0.8405349
  0.19752231 -0.07848166], shape=(26,), dtype=float32)
B tf.Tensor(
[-16.386528     3.151216    -0.13298927   4.371477     6.127008
   5.412611     0.2834479   -0.65268487   0.40095583   0.2454046
  -0.5782857   -0.423165    -0.8718532   -0.24084271   0.44111174
   1.492601    -1.9069418    0.2948684   -0.45236063  -0.5574946
   0.21208026  -0.6371893   -0.29641697   0.44038427   0.4952675
  -0.30927685], shape=(26,), dtype=float32)

Example output Tensor for spectogram[0][10][:26] where: A = C++, B = tf.signal against sample-000000.wav Note the minor differences -> CPU vs GPU floating points. Any implementation using tf.signal would likely be an estimation of the way DeepSpeech does it. Whether that’s close enough, I’m not sure. Would need to run some tests/have a think…

A tf.Tensor(
[2.5844438e-02 1.4196590e+01 7.8703909e+00 1.4777693e+00 9.0667505e+00
 1.7904934e+00 4.6659380e-02 1.6692156e-01 2.9524408e-02 1.2653788e-02
 1.4637280e-02 5.9227698e-04 1.2959058e-03 8.3363376e-04 2.5570781e-03
 4.5909737e-03 2.1686291e-03 5.4116650e-03 8.5209711e-03 1.0986484e-03
 9.5019070e-03 9.1623375e-03 5.7589263e-03 2.1361437e-02 3.0068232e-02
 1.6542105e-02], shape=(26,), dtype=float32)
B tf.Tensor(
[2.58445051e-02 1.41965885e+01 7.87038898e+00 1.47776914e+00
 9.06674957e+00 1.79049325e+00 4.66593951e-02 1.66921541e-01
 2.95244064e-02 1.26537923e-02 1.46372588e-02 5.92276279e-04
 1.29591115e-03 8.33630445e-04 2.55708303e-03 4.59098117e-03
 2.16862885e-03 5.41166542e-03 8.52095429e-03 1.09864434e-03
 9.50190984e-03 9.16234031e-03 5.75892068e-03 2.13614274e-02
 3.00682262e-02 1.65421031e-02], shape=(26,), dtype=float32)

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6

github_iconTop GitHub Comments

cli0commented, Dec 9, 2019

Have you had any success in rewriting the code for DS 0.5.x?

dijksterhuiscommented, May 26, 2021


Read more comments on GitHub >

github_iconTop Results From Across the Web

Meeting Notes · mozilla/DeepSpeech Wiki - GitHub
DeepSpeech is an open source embedded (offline, ... Train Tacotron with x-vectors; Implementing different TTS architectures (Nx conv encoder) ...
Read more >
DeepSpeech Model — Mozilla DeepSpeech 0.10.0-alpha.3 ...
The goal of our RNN is to convert an input sequence x into a sequence of character probabilities for the transcription y, with...
Read more >
arXiv:1905.03828v2 [cs.LG] 15 Aug 2019
In this work, we demonstrate the existence of universal adver- sarial audio perturbations that cause mis-transcription of audio.
Read more >
Child Speech Recognition - NTNU Open
pose of this thesis is to improve and implement an automatic speech recogni- ... c(n) = M°1. X m=0 log10(s(m))cos≥ºn(m °0.5).
Read more >
deepspeech 0.5.1 - PyPI
Please refer to your system's documentation on how to install these dependencies. CUDA dependency. The GPU capable builds (Python, NodeJS, C++, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Post

No results found

github_iconTop Related Hashnode Post

No results found