Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RNN-T Decoding - Large Number of Deletions Compared to Transformer/Conformer

See original GitHub issue

I am getting a lot of deletions in my RNN-T training/decoding setup relative to Transformer/Conformer. The data is the “Malach” corpus; about 200 hours of English but accented speech from Holocaust survivors. Appreciate any insights/suggestions anyone may have!

This is the sclite outputs from all three systems:

Condition	# Snt	# Wrd	Corr	Sub	Del	Ins	Err	S.Err
Transformer	1155	12256	82.0	13.6	4.3	4.8	22.8	68.1
RNN-T	1155	12256	70.4	13.5	16.1	2.7	32.3	73.9
Conformer	1155	12256	81.8	12.9	5.3	4.1	22.3	68.4

I also attached training and decoding configs for the RNN-T.

This is the training config for the RNNT:

# The conformer transducer training configuration from @jeon30c
# WERs for test-clean/test-other are 2.9 and 7.2, respectively.
# Trained with Tesla V100-SXM2(32GB) x 8 GPUs. It takes about 1.5 days.
batch_type: numel
batch_bins: 20000000
accum_grad: 2
max_epoch: 100
patience: none
init: none
best_model_criterion:
-   - valid
    - loss
    - min
keep_nbest_models: 10

model_conf:
    ctc_weight: 0.0
    report_cer: False
    report_wer: False

encoder: conformer
encoder_conf:
    output_size: 512
    attention_heads: 8
    linear_units: 2048
    num_blocks: 12
    dropout_rate: 0.1
    positional_dropout_rate: 0.1
    attention_dropout_rate: 0.1
    input_layer: conv2d
    normalize_before: true
    macaron_style: true
    pos_enc_layer_type: "rel_pos"
    selfattention_layer_type: "rel_selfattn"
    activation_type: "swish"
    use_cnn_module:  true
    cnn_module_kernel: 31

decoder: transducer
decoder_conf:
    rnn_type: lstm
    num_layers: 1
    hidden_size: 512
    dropout: 0.1
    dropout_embed: 0.1

joint_net_conf:
    joint_space_size: 640

optim: adam
optim_conf:
    lr: 0.0015
scheduler: warmuplr
scheduler_conf:
    warmup_steps: 25000

frontend_conf:
  n_fft: 512
  hop_length: 160 

specaug: specaug
specaug_conf:
    apply_time_warp: true
    time_warp_window: 5
    time_warp_mode: bicubic
    apply_freq_mask: true
    freq_mask_width_range:
    - 0
    - 30
    num_freq_mask: 2
    apply_time_mask: true
    time_mask_width_range:
    - 0
    - 40
    num_time_mask: 2

This is the decoding config:

# The conformer transducer decoding configuration from @jeon30c
beam_size: 10
transducer_conf:
    search_type: default
    score_norm: True

Issue Analytics

State:
Created a year ago
Comments:6

Top GitHub Comments

1reaction

picheny-nyucommented, Jul 27, 2022

Thanks. Here is a good reference on MALACH https://www.isca-speech.org/archive_v0/Interspeech_2019/pdfs/1907.pdf

0reactions

b-flocommented, Jul 27, 2022

Happy to try other methods but I need some parameter recommendations.

Sure, you can start with these ones:

search_type: alsd # (or maes)
u_max: 250
nstep: 3
prefix_alpha: 2
expansion_gamma: 2
expansion_beta: 2.3

And try increasing/decreasing u_max (for ALSD) and nstep (for mAES). It controls expansion along either the label axis or time axis.

Also, is there some way on the default beam search to change the language model weight?

You can set lm_weight: x.x in your decode config. Not sure what’s the default value in this version.

Top Results From Across the Web

arXiv:2108.10752v2 [eess.AS] 17 Jun 2022

The RNN-T consists of an encoder, prediction, and joint- networks, transcribing acoustic frames into output tokens (e.g., a word-piece unit).

On the Comparison of Popular End-to-End Models for Large ...

based encoder-decoder (AED), and Transformer-AED. In this study, we conduct an empirical comparison of RNN-T, RNN-. AED, and Transformer-AED models, ...

An Improvement to Conformer-Based Model for High ... - NCBI

Furthermore, the bi-transformer model was adopted in the decoding ... including a large number of parameters, more data for learning, ...

A Streamable Speech Recognition Model with Transformer ...

... Recently, Transformer-based models have demonstrated promising results in a variety of ASR and NLP tasks and are comparable to recurrent neural networks,...

Performance analysis of ondevice streaming speech recognition

streaming speech recognition, attention, endtoend, transformer, conformer, on ... RNN recurrent neural network. RNNT recurrent neural network transducer.