question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Recipe to use freshly released streaming models (Augmented-memory and Emformer) on ASR ?

See original GitHub issue
  • fairseq Version : master
  • PyTorch Version 1.7
  • OS (e.g., Linux): Linux
  • How you installed fairseq : git clone
  • Python version: 3.8.5
  • CUDA/cuDNN version: 10.2
  • GPU models and configuration: Tesla V100 on computing server

Hi,

I’d like to know if it is possible to use freshly released streaming encoders (Augmented memory transformer and emformer) for streaming ASR purposes (like training on LibriSpeech for example) ? For now, I see it in simultaneous translation folder.

I tried to follow the usual ASR LibriSpeech recipe (https://github.com/pytorch/fairseq/blob/master/examples/speech_to_text/docs/librispeech_example.md) mixed with SimulST recipe (https://github.com/pytorch/fairseq/blob/master/examples/speech_to_text/docs/simulst_mustc_example.md) by first pretraining an ASR model :

fairseq-train ${LS_ROOT} --save-dir ${SAVE_DIR} --config-yaml config.yaml --train-subset train --valid-subset dev --num-workers 32 --max-tokens 40000 --max-update 100000 --task speech_to_text --criterion label_smoothed_cross_entropy --report-accuracy --arch convtransformer_espnet --share-decoder-input-output-embed --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8 --fp16,

and then running :

fairseq-train ${LS_ROOT} --save-dir ${SAVE_DIR} --config-yaml config.yaml --train-subset train --valid-subset dev --num-workers 32 --max-tokens 40000 --max-update 300000 --task speech_to_text --criterion label_smoothed_cross_entropy --report-accuracy --arch convtransformer_augmented_memory --share-decoder-input-output-embed --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8 --simul-type infinite_lookback_fixed_pre_decision --fixed-pre-decision-ratio 7 --segment-size 40 --fp16

for my final model training, but I got an exception :

Exception: Cannot load model parameters from checkpoint /path/to/checkpoint_last.pt; please ensure that the architectures match

Hence, i don’t know if this is the proper way to do it. Also, I got the same error when trying to train another SimulST architecture, such as --arch convtransformer_simul_trans_espnet.

But still, apart from the mismatch between models, is this workaround OK to train a streaming model on ASR task ?

Thanks in advance for your answer.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:3
  • Comments:5

github_iconTop GitHub Comments

1reaction
George0828Zhangcommented, Feb 27, 2022

Hi, there’s actually a working implementation of emformer from torchaudio as suggested by @SatenHarutyunyan (thanks a ton, btw)

https://github.com/pytorch/audio/blob/48cfbf2ba8ca4521e181d7c6b7b424829b6dcba4/test/torchaudio_unittest/prototype/emformer_test_impl.py

Actually, they have moved to the main code of torchaudio’s, which I found here: code: https://github.com/pytorch/audio/blob/main/torchaudio/models/emformer.py docs: https://pytorch.org/audio/main/models.html#emformer

I’ve tested it and it worked like a charm.

0reactions
duj12commented, Feb 18, 2022

following. I also want to use “convtransformer_augmented_memory” arch to do Speech Translation Task, but now have no idea.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Recipe to use freshly released streaming models ... - GitHub
Hi,. I'd like to know if it is possible to use freshly released streaming encoders (Augmented memory transformer and emformer) for streaming ASR...
Read more >
arXiv:2211.11419v2 [cs.SD] 22 Nov 2022
ABSTRACT. This paper presents an in-depth study on a Sequentially Sam- pled Chunk Conformer, SSC-Conformer, for streaming End-.
Read more >
Streaming Simultaneous Speech Translation with Augmented ...
Ma et al. (2021a) enable the streaming model to handle long input by equipping with an augmented memory encoder. Chen et al. (2021)...
Read more >
Julian Chan | Semantic Scholar
An efficient memory transformer Emformer for low latency streaming speech recognition where the long-range history context is distilled into an augmented ...
Read more >
Towards Measuring Fairness in Speech Recognition: Casual ...
Multiple ASR models are evaluated, including models trained on LibriSpeech, ... We are releasing human transcripts from the Casual Conversations dataset to ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found