Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Recipe to use freshly released streaming models (Augmented-memory and Emformer) on ASR ?

See original GitHub issue

fairseq Version : master
PyTorch Version 1.7
OS (e.g., Linux): Linux
How you installed fairseq : git clone
Python version: 3.8.5
CUDA/cuDNN version: 10.2
GPU models and configuration: Tesla V100 on computing server

Hi,

I’d like to know if it is possible to use freshly released streaming encoders (Augmented memory transformer and emformer) for streaming ASR purposes (like training on LibriSpeech for example) ? For now, I see it in simultaneous translation folder.

I tried to follow the usual ASR LibriSpeech recipe (https://github.com/pytorch/fairseq/blob/master/examples/speech_to_text/docs/librispeech_example.md) mixed with SimulST recipe (https://github.com/pytorch/fairseq/blob/master/examples/speech_to_text/docs/simulst_mustc_example.md) by first pretraining an ASR model :

fairseq-train ${LS_ROOT} --save-dir ${SAVE_DIR} --config-yaml config.yaml --train-subset train --valid-subset dev --num-workers 32 --max-tokens 40000 --max-update 100000 --task speech_to_text --criterion label_smoothed_cross_entropy --report-accuracy --arch convtransformer_espnet --share-decoder-input-output-embed --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8 --fp16,

and then running :

fairseq-train ${LS_ROOT} --save-dir ${SAVE_DIR} --config-yaml config.yaml --train-subset train --valid-subset dev --num-workers 32 --max-tokens 40000 --max-update 300000 --task speech_to_text --criterion label_smoothed_cross_entropy --report-accuracy --arch convtransformer_augmented_memory --share-decoder-input-output-embed --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8 --simul-type infinite_lookback_fixed_pre_decision --fixed-pre-decision-ratio 7 --segment-size 40 --fp16

for my final model training, but I got an exception :

Exception: Cannot load model parameters from checkpoint /path/to/checkpoint_last.pt; please ensure that the architectures match

Hence, i don’t know if this is the proper way to do it. Also, I got the same error when trying to train another SimulST architecture, such as --arch convtransformer_simul_trans_espnet.

But still, apart from the mismatch between models, is this workaround OK to train a streaming model on ASR task ?

Thanks in advance for your answer.

Issue Analytics

State:
Created 3 years ago
Reactions:3
Comments:5

Top GitHub Comments

1reaction

George0828Zhangcommented, Feb 27, 2022

Hi, there’s actually a working implementation of emformer from torchaudio as suggested by @SatenHarutyunyan (thanks a ton, btw)

https://github.com/pytorch/audio/blob/48cfbf2ba8ca4521e181d7c6b7b424829b6dcba4/test/torchaudio_unittest/prototype/emformer_test_impl.py

Actually, they have moved to the main code of torchaudio’s, which I found here: code: https://github.com/pytorch/audio/blob/main/torchaudio/models/emformer.py docs: https://pytorch.org/audio/main/models.html#emformer

I’ve tested it and it worked like a charm.

0reactions

duj12commented, Feb 18, 2022

following. I also want to use “convtransformer_augmented_memory” arch to do Speech Translation Task, but now have no idea.

Top Results From Across the Web

Recipe to use freshly released streaming models ... - GitHub

Hi,. I'd like to know if it is possible to use freshly released streaming encoders (Augmented memory transformer and emformer) for streaming ASR...

arXiv:2211.11419v2 [cs.SD] 22 Nov 2022

ABSTRACT. This paper presents an in-depth study on a Sequentially Sam- pled Chunk Conformer, SSC-Conformer, for streaming End-.

Streaming Simultaneous Speech Translation with Augmented ...

Ma et al. (2021a) enable the streaming model to handle long input by equipping with an augmented memory encoder. Chen et al. (2021)...

Julian Chan | Semantic Scholar

An efficient memory transformer Emformer for low latency streaming speech recognition where the long-range history context is distilled into an augmented ...

Towards Measuring Fairness in Speech Recognition: Casual ...

Multiple ASR models are evaluated, including models trained on LibriSpeech, ... We are releasing human transcripts from the Casual Conversations dataset to ...