RNN-T Decoding - Large Number of Deletions Compared to Transformer/Conformer
See original GitHub issueI am getting a lot of deletions in my RNN-T training/decoding setup relative to Transformer/Conformer. The data is the “Malach” corpus; about 200 hours of English but accented speech from Holocaust survivors. Appreciate any insights/suggestions anyone may have!
This is the sclite outputs from all three systems:
Condition | # Snt | # Wrd | Corr | Sub | Del | Ins | Err | S.Err |
---|---|---|---|---|---|---|---|---|
Transformer | 1155 | 12256 | 82.0 | 13.6 | 4.3 | 4.8 | 22.8 | 68.1 |
RNN-T | 1155 | 12256 | 70.4 | 13.5 | 16.1 | 2.7 | 32.3 | 73.9 |
Conformer | 1155 | 12256 | 81.8 | 12.9 | 5.3 | 4.1 | 22.3 | 68.4 |
I also attached training and decoding configs for the RNN-T.
This is the training config for the RNNT:
# The conformer transducer training configuration from @jeon30c
# WERs for test-clean/test-other are 2.9 and 7.2, respectively.
# Trained with Tesla V100-SXM2(32GB) x 8 GPUs. It takes about 1.5 days.
batch_type: numel
batch_bins: 20000000
accum_grad: 2
max_epoch: 100
patience: none
init: none
best_model_criterion:
- - valid
- loss
- min
keep_nbest_models: 10
model_conf:
ctc_weight: 0.0
report_cer: False
report_wer: False
encoder: conformer
encoder_conf:
output_size: 512
attention_heads: 8
linear_units: 2048
num_blocks: 12
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.1
input_layer: conv2d
normalize_before: true
macaron_style: true
pos_enc_layer_type: "rel_pos"
selfattention_layer_type: "rel_selfattn"
activation_type: "swish"
use_cnn_module: true
cnn_module_kernel: 31
decoder: transducer
decoder_conf:
rnn_type: lstm
num_layers: 1
hidden_size: 512
dropout: 0.1
dropout_embed: 0.1
joint_net_conf:
joint_space_size: 640
optim: adam
optim_conf:
lr: 0.0015
scheduler: warmuplr
scheduler_conf:
warmup_steps: 25000
frontend_conf:
n_fft: 512
hop_length: 160
specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 30
num_freq_mask: 2
apply_time_mask: true
time_mask_width_range:
- 0
- 40
num_time_mask: 2
This is the decoding config:
# The conformer transducer decoding configuration from @jeon30c
beam_size: 10
transducer_conf:
search_type: default
score_norm: True
Issue Analytics
- State:
- Created a year ago
- Comments:6
Top Results From Across the Web
arXiv:2108.10752v2 [eess.AS] 17 Jun 2022
The RNN-T consists of an encoder, prediction, and joint- networks, transcribing acoustic frames into output tokens (e.g., a word-piece unit).
Read more >On the Comparison of Popular End-to-End Models for Large ...
based encoder-decoder (AED), and Transformer-AED. In this study, we conduct an empirical comparison of RNN-T, RNN-. AED, and Transformer-AED models, ...
Read more >An Improvement to Conformer-Based Model for High ... - NCBI
Furthermore, the bi-transformer model was adopted in the decoding ... including a large number of parameters, more data for learning, ...
Read more >A Streamable Speech Recognition Model with Transformer ...
... Recently, Transformer-based models have demonstrated promising results in a variety of ASR and NLP tasks and are comparable to recurrent neural networks,...
Read more >Performance analysis of ondevice streaming speech recognition
streaming speech recognition, attention, endtoend, transformer, conformer, on ... RNN recurrent neural network. RNNT recurrent neural network transducer.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks. Here is a good reference on MALACH https://www.isca-speech.org/archive_v0/Interspeech_2019/pdfs/1907.pdf
Sure, you can start with these ones:
And try increasing/decreasing
u_max
(for ALSD) andnstep
(for mAES). It controls expansion along either the label axis or time axis.You can set
lm_weight: x.x
in your decode config. Not sure what’s the default value in this version.