Too low BLEU score in reproducing simultaneous speech translation (MuST-C en-de)
See original GitHub issueI tried to train simultaneous speech translation following simul_mustc_example.md. I trained simulst in bc3bd55ec98c39af45ff7323ae49bcbdf93acc36
branch (because in main 1ef3d6a1a2cb7fa9937233c8bf796957871bfc94
branch, Not found arch error was occurred. Preprocess and pretraining ASR was in main branch. )
However, the simuleval’s BLEU result was terribly low. (documentation’s BLEU : about 13)
{
"Quality": {
"BLEU": 0.010281760678875936
},
"Latency": {
"AL": 1299.5488329014138,
"AL_CA": 1662.8518504975802,
"AP": 0.45365599087862923,
"AP_CA": 0.5802261617796486,
"DAL": 1460.296257343413,
"DAL_CA": 1764.66547917616
}
}
And most of the prediction in instance.log were like (Applause), (Musik).
{"index": 0, "prediction": "(Applaus) </s>", "delays": [1680.0, 1680.0], "elapsed": [2029.341630935669, 2033.8196086883545], "prediction_length": 2, "reference": "Diese Durchbr\u00fcche m\u00fcssen wir mit Vollgas verfolgen und das k\u00f6nnen wir messen: in Firmenzahlen, in Pilotprojekten und Regulierungs\u00e4nderungen.", "source": ["/*/dev/ted_767_0.wav", "samplerate: 16000 Hz", "channels: 1", "duration: 8.600 s", "format: WAV (Microsoft) [WAV]", "subtype: Signed 16 bit PCM [PCM_16]"], "source_length": 8600.0, "reference_length": 18, "metric": {"sentence_bleu": 0.44439199324869233, "latency": {"AL": 1453.6842041015625, "AP": 0.1953488439321518, "DAL": 1680.0}, "latency_ca": {"AL": 1805.264892578125, "AP": 0.2362302988767624, "DAL": 2029.341796875}}}
{"index": 1, "prediction": "(Applaus) </s>", "delays": [1680.0, 1680.0], "elapsed": [2007.0959854125977, 2011.3920497894287], "prediction_length": 2, "reference": "Es gibt viele gro\u00dfartige B\u00fccher zu diesem Thema.", "source": ["/*/dev/ted_767_1.wav", "samplerate: 16000 Hz", "channels: 1", "duration: 2.530 s", "format: WAV (Microsoft) [WAV]", "subtype: Signed 16 bit PCM [PCM_16]"], "source_length": 2530.0, "reference_length": 8, "metric": {"sentence_bleu": 2.4675789207681893, "latency": {"AL": 1539.4444580078125, "AP": 0.6640316247940063, "DAL": 1680.0}, "latency_ca": {"AL": 1868.6884765625, "AP": 0.7941675782203674, "DAL": 2007.095947265625}}}
If there are any solutions, please let me know. Thank you.
Code
Pre-trained ASR : checkpoint_best.pt with this code
fairseq-train ${OUT_ROOT} \
--config-yaml config_asr.yaml --train-subset train_asr --valid-subset dev_asr \
--save-dir ${TMP} --num-workers 1 --max-tokens 20000 --max-update 100000 \
--task speech_to_text --criterion label_smoothed_cross_entropy --report-accuracy \
--arch convtransformer_espnet --optimizer adam --lr 0.0005 --lr-scheduler inverse_sqrt \
--warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 16 --patience 4
Simultaneous speech translation
fairseq-train ${OUT_ROOT} \
--config-yaml config_st.yaml --train-subset train_st --valid-subset dev_st \
--save-dir ${TMP} --num-workers 1 \
--optimizer adam --lr 0.0001 --lr-scheduler inverse_sqrt --clip-norm 10.0 \
--criterion label_smoothed_cross_entropy \
--warmup-updates 4000 --max-update 100000 --max-tokens 20000 --seed 1 \
--load-pretrained-encoder-from ${ASR_SAVE_DIR}/checkpoint_best.pt \
--task simul_speech_to_text \
--arch convtransformer_simul_trans_espnet \
--simul-type waitk_fixed_pre_decision \
--waitk-lagging 3 \
--fixed-pre-decision-ratio 7 \
--update-freq 16 \
--patience 4
What’s your environment?
- OS : Ubuntu 18.04
- Python version: 3.7.4
- PyTorch version : 1.8.1
dev_st loss
Issue Analytics
- State:
- Created 2 years ago
- Comments:9
Top Results From Across the Web
Too low BLEU score in reproducing simultaneous speech ...
I tried to train simultaneous speech translation following simul_mustc_example.md.
Read more >End-to-End Simultaneous Speech to Text Translation
Streaming Simultaneous Speech Translation with Augmented Memory Transformer ... scoring 25 BLEU at medium latency (< 2s) and 30 BLEU at high ...
Read more >A Gentle Introduction to Calculating the BLEU Score for Text in ...
BLEU, or the Bilingual Evaluation Understudy, is a score for comparing a candidate translation of text to one or more reference translations.
Read more >AutoSimTrans 2022 Automatic Simultaneous Translation ...
End -to-End Simultaneous Speech Translation with Pretraining and Distillation: Huawei ... Table 5: The highest BLEU scores achieved by BIT-.
Read more >Simultaneous neural machine translation with a reinforced ...
For online speech recognition, a sequence-to-sequence model incorporating hard and soft attention was proposed [23]. However, this method has a ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
After debugging the training code, I found the reason. There is an example: encoder_state has 14 frames, and text_sequence has 4 tokens, batch_size = 1. In wait-k(k=3) mode, the p_choose(in p_choose_strategy.py) is formulated as: [[[0 0 1 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 1 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 1 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 1 0 0 0 0 0 0 0 0]]] and alpha is the same as p_choose, beta is something like [[[0.3 0.3 0.4 0 0 0 0 0 0 0 0 0 0 0] [0.2 0.3 0.3 0.2 0 0 0 0 0 0 0 0 0 0] [0.2 0.2 0.1 0.2 0.3 0 0 0 0 0 0 0 0 0] [0.1 0.2 0.2 0.1 0.2 0.2 0 0 0 0 0 0 0 0]]] we can see, only context between [0~7) [0~k+text_seq_len) frame will be weighted sum, but the total context length is equal to the encoder_state’ length.
So in this implementation, too many context near the tail are ignored, which may be the reason of poor performance.
even worse ,following simul_mustc_example.md for preprocess and train . On the full MustC-ende dataset (69G), I trained ASR model for about 120hours on 8GPUs with 900 epoch (I shut down it because it is too slow ), and ST model for about 70 epoch(I stop it early because it is also too slow ), then for evaluation, I use seg_mustc_data.py to split the dataset, and I use 100 sentences of them for evaluation(simuleval Connection refused when testing set is large) . The result is very poor …
Does anyone have some suggestions? Thank you!