Recipe to use freshly released streaming models (Augmented-memory and Emformer) on ASR ?
See original GitHub issue- fairseq Version : master
- PyTorch Version 1.7
- OS (e.g., Linux): Linux
- How you installed fairseq : git clone
- Python version: 3.8.5
- CUDA/cuDNN version: 10.2
- GPU models and configuration: Tesla V100 on computing server
Hi,
I’d like to know if it is possible to use freshly released streaming encoders (Augmented memory transformer and emformer) for streaming ASR purposes (like training on LibriSpeech for example) ? For now, I see it in simultaneous translation folder.
I tried to follow the usual ASR LibriSpeech recipe (https://github.com/pytorch/fairseq/blob/master/examples/speech_to_text/docs/librispeech_example.md) mixed with SimulST recipe (https://github.com/pytorch/fairseq/blob/master/examples/speech_to_text/docs/simulst_mustc_example.md) by first pretraining an ASR model :
fairseq-train ${LS_ROOT} --save-dir ${SAVE_DIR} --config-yaml config.yaml --train-subset train --valid-subset dev --num-workers 32 --max-tokens 40000 --max-update 100000 --task speech_to_text --criterion label_smoothed_cross_entropy --report-accuracy --arch convtransformer_espnet --share-decoder-input-output-embed --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8 --fp16
,
and then running :
fairseq-train ${LS_ROOT} --save-dir ${SAVE_DIR} --config-yaml config.yaml --train-subset train --valid-subset dev --num-workers 32 --max-tokens 40000 --max-update 300000 --task speech_to_text --criterion label_smoothed_cross_entropy --report-accuracy --arch convtransformer_augmented_memory --share-decoder-input-output-embed --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8 --simul-type infinite_lookback_fixed_pre_decision --fixed-pre-decision-ratio 7 --segment-size 40 --fp16
for my final model training, but I got an exception :
Exception: Cannot load model parameters from checkpoint /path/to/checkpoint_last.pt; please ensure that the architectures match
Hence, i don’t know if this is the proper way to do it.
Also, I got the same error when trying to train another SimulST architecture, such as --arch convtransformer_simul_trans_espnet
.
But still, apart from the mismatch between models, is this workaround OK to train a streaming model on ASR task ?
Thanks in advance for your answer.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:5
Top GitHub Comments
Hi, there’s actually a working implementation of emformer from torchaudio as suggested by @SatenHarutyunyan (thanks a ton, btw)
Actually, they have moved to the main code of torchaudio’s, which I found here: code: https://github.com/pytorch/audio/blob/main/torchaudio/models/emformer.py docs: https://pytorch.org/audio/main/models.html#emformer
I’ve tested it and it worked like a charm.
following. I also want to use “convtransformer_augmented_memory” arch to do Speech Translation Task, but now have no idea.