question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RNN-T Decoding - Large Number of Deletions Compared to Transformer/Conformer

See original GitHub issue

I am getting a lot of deletions in my RNN-T training/decoding setup relative to Transformer/Conformer. The data is the “Malach” corpus; about 200 hours of English but accented speech from Holocaust survivors. Appreciate any insights/suggestions anyone may have!

This is the sclite outputs from all three systems:

Condition # Snt # Wrd Corr Sub Del Ins Err S.Err
Transformer 1155 12256 82.0 13.6 4.3 4.8 22.8 68.1
RNN-T 1155 12256 70.4 13.5 16.1 2.7 32.3 73.9
Conformer 1155 12256 81.8 12.9 5.3 4.1 22.3 68.4

I also attached training and decoding configs for the RNN-T.

This is the training config for the RNNT:

# The conformer transducer training configuration from @jeon30c
# WERs for test-clean/test-other are 2.9 and 7.2, respectively.
# Trained with Tesla V100-SXM2(32GB) x 8 GPUs. It takes about 1.5 days.
batch_type: numel
batch_bins: 20000000
accum_grad: 2
max_epoch: 100
patience: none
init: none
best_model_criterion:
-   - valid
    - loss
    - min
keep_nbest_models: 10

model_conf:
    ctc_weight: 0.0
    report_cer: False
    report_wer: False

encoder: conformer
encoder_conf:
    output_size: 512
    attention_heads: 8
    linear_units: 2048
    num_blocks: 12
    dropout_rate: 0.1
    positional_dropout_rate: 0.1
    attention_dropout_rate: 0.1
    input_layer: conv2d
    normalize_before: true
    macaron_style: true
    pos_enc_layer_type: "rel_pos"
    selfattention_layer_type: "rel_selfattn"
    activation_type: "swish"
    use_cnn_module:  true
    cnn_module_kernel: 31

decoder: transducer
decoder_conf:
    rnn_type: lstm
    num_layers: 1
    hidden_size: 512
    dropout: 0.1
    dropout_embed: 0.1

joint_net_conf:
    joint_space_size: 640

optim: adam
optim_conf:
    lr: 0.0015
scheduler: warmuplr
scheduler_conf:
    warmup_steps: 25000

frontend_conf:
  n_fft: 512
  hop_length: 160 

specaug: specaug
specaug_conf:
    apply_time_warp: true
    time_warp_window: 5
    time_warp_mode: bicubic
    apply_freq_mask: true
    freq_mask_width_range:
    - 0
    - 30
    num_freq_mask: 2
    apply_time_mask: true
    time_mask_width_range:
    - 0
    - 40
    num_time_mask: 2

This is the decoding config:

# The conformer transducer decoding configuration from @jeon30c
beam_size: 10
transducer_conf:
    search_type: default
    score_norm: True

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:6

github_iconTop GitHub Comments

1reaction
picheny-nyucommented, Jul 27, 2022
0reactions
b-flocommented, Jul 27, 2022

Happy to try other methods but I need some parameter recommendations.

Sure, you can start with these ones:

search_type: alsd # (or maes)
u_max: 250
nstep: 3
prefix_alpha: 2
expansion_gamma: 2
expansion_beta: 2.3

And try increasing/decreasing u_max (for ALSD) and nstep (for mAES). It controls expansion along either the label axis or time axis.

Also, is there some way on the default beam search to change the language model weight?

You can set lm_weight: x.x in your decode config. Not sure what’s the default value in this version.

Read more comments on GitHub >

github_iconTop Results From Across the Web

arXiv:2108.10752v2 [eess.AS] 17 Jun 2022
The RNN-T consists of an encoder, prediction, and joint- networks, transcribing acoustic frames into output tokens (e.g., a word-piece unit).
Read more >
On the Comparison of Popular End-to-End Models for Large ...
based encoder-decoder (AED), and Transformer-AED. In this study, we conduct an empirical comparison of RNN-T, RNN-. AED, and Transformer-AED models, ...
Read more >
An Improvement to Conformer-Based Model for High ... - NCBI
Furthermore, the bi-transformer model was adopted in the decoding ... including a large number of parameters, more data for learning, ...
Read more >
A Streamable Speech Recognition Model with Transformer ...
... Recently, Transformer-based models have demonstrated promising results in a variety of ASR and NLP tasks and are comparable to recurrent neural networks,...
Read more >
Performance analysis of ondevice streaming speech recognition
streaming speech recognition, attention, endtoend, transformer, conformer, on ... RNN recurrent neural network. RNNT recurrent neural network transducer.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found