question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Too low BLEU score in reproducing simultaneous speech translation (MuST-C en-de)

See original GitHub issue

I tried to train simultaneous speech translation following simul_mustc_example.md. I trained simulst in bc3bd55ec98c39af45ff7323ae49bcbdf93acc36 branch (because in main 1ef3d6a1a2cb7fa9937233c8bf796957871bfc94 branch, Not found arch error was occurred. Preprocess and pretraining ASR was in main branch. )

However, the simuleval’s BLEU result was terribly low. (documentation’s BLEU : about 13)

{
    "Quality": {
        "BLEU": 0.010281760678875936
    },
    "Latency": {
        "AL": 1299.5488329014138,
        "AL_CA": 1662.8518504975802,
        "AP": 0.45365599087862923,
        "AP_CA": 0.5802261617796486,
        "DAL": 1460.296257343413,
        "DAL_CA": 1764.66547917616
    }
}

And most of the prediction in instance.log were like (Applause), (Musik).

{"index": 0, "prediction": "(Applaus) </s>", "delays": [1680.0, 1680.0], "elapsed": [2029.341630935669, 2033.8196086883545], "prediction_length": 2, "reference": "Diese Durchbr\u00fcche m\u00fcssen wir mit Vollgas verfolgen und das k\u00f6nnen wir messen: in Firmenzahlen, in Pilotprojekten und Regulierungs\u00e4nderungen.", "source": ["/*/dev/ted_767_0.wav", "samplerate: 16000 Hz", "channels: 1", "duration: 8.600 s", "format: WAV (Microsoft) [WAV]", "subtype: Signed 16 bit PCM [PCM_16]"], "source_length": 8600.0, "reference_length": 18, "metric": {"sentence_bleu": 0.44439199324869233, "latency": {"AL": 1453.6842041015625, "AP": 0.1953488439321518, "DAL": 1680.0}, "latency_ca": {"AL": 1805.264892578125, "AP": 0.2362302988767624, "DAL": 2029.341796875}}}
{"index": 1, "prediction": "(Applaus) </s>", "delays": [1680.0, 1680.0], "elapsed": [2007.0959854125977, 2011.3920497894287], "prediction_length": 2, "reference": "Es gibt viele gro\u00dfartige B\u00fccher zu diesem Thema.", "source": ["/*/dev/ted_767_1.wav", "samplerate: 16000 Hz", "channels: 1", "duration: 2.530 s", "format: WAV (Microsoft) [WAV]", "subtype: Signed 16 bit PCM [PCM_16]"], "source_length": 2530.0, "reference_length": 8, "metric": {"sentence_bleu": 2.4675789207681893, "latency": {"AL": 1539.4444580078125, "AP": 0.6640316247940063, "DAL": 1680.0}, "latency_ca": {"AL": 1868.6884765625, "AP": 0.7941675782203674, "DAL": 2007.095947265625}}}

If there are any solutions, please let me know. Thank you.

Code

Pre-trained ASR : checkpoint_best.pt with this code

fairseq-train ${OUT_ROOT} \
  --config-yaml config_asr.yaml --train-subset train_asr --valid-subset dev_asr \
  --save-dir ${TMP} --num-workers 1 --max-tokens 20000 --max-update 100000 \
  --task speech_to_text --criterion label_smoothed_cross_entropy --report-accuracy \
  --arch convtransformer_espnet --optimizer adam --lr 0.0005 --lr-scheduler inverse_sqrt \
  --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 16 --patience 4

Simultaneous speech translation

fairseq-train ${OUT_ROOT} \
       --config-yaml config_st.yaml --train-subset train_st --valid-subset dev_st \
       --save-dir ${TMP} --num-workers 1 \
       --optimizer adam --lr 0.0001 --lr-scheduler inverse_sqrt --clip-norm 10.0 \
       --criterion label_smoothed_cross_entropy \
       --warmup-updates 4000 --max-update 100000 --max-tokens 20000 --seed 1 \
       --load-pretrained-encoder-from ${ASR_SAVE_DIR}/checkpoint_best.pt \
       --task simul_speech_to_text \
       --arch convtransformer_simul_trans_espnet \
       --simul-type waitk_fixed_pre_decision \
       --waitk-lagging 3 \
       --fixed-pre-decision-ratio 7 \
       --update-freq 16 \
       --patience 4

What’s your environment?

  • OS : Ubuntu 18.04
  • Python version: 3.7.4
  • PyTorch version : 1.8.1

dev_st loss

image

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:9

github_iconTop GitHub Comments

2reactions
duj12commented, Mar 17, 2022

After debugging the training code, I found the reason. There is an example: encoder_state has 14 frames, and text_sequence has 4 tokens, batch_size = 1. In wait-k(k=3) mode, the p_choose(in p_choose_strategy.py) is formulated as: [[[0 0 1 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 1 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 1 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 1 0 0 0 0 0 0 0 0]]] and alpha is the same as p_choose, beta is something like [[[0.3 0.3 0.4 0 0 0 0 0 0 0 0 0 0 0] [0.2 0.3 0.3 0.2 0 0 0 0 0 0 0 0 0 0] [0.2 0.2 0.1 0.2 0.3 0 0 0 0 0 0 0 0 0] [0.1 0.2 0.2 0.1 0.2 0.2 0 0 0 0 0 0 0 0]]] we can see, only context between [0~7) [0~k+text_seq_len) frame will be weighted sum, but the total context length is equal to the encoder_state’ length.

So in this implementation, too many context near the tail are ignored, which may be the reason of poor performance.

0reactions
EricLinacommented, Mar 9, 2022

even worse ,following simul_mustc_example.md for preprocess and train . On the full MustC-ende dataset (69G), I trained ASR model for about 120hours on 8GPUs with 900 epoch (I shut down it because it is too slow ), and ST model for about 70 epoch(I stop it early because it is also too slow ), then for evaluation, I use seg_mustc_data.py to split the dataset, and I use 100 sentences of them for evaluation(simuleval Connection refused when testing set is large) . The result is very poor …

Does anyone have some suggestions? Thank you!

2022-03-09 20:34:51 | INFO     | simuleval.cli    | Evaluation results:
{
    "Quality": {
        "BLEU": 0.2027780041409297
    },
    "Latency": {
        "AL": 1248.1308325195312,
        "AL_CA": 15497.652229003907,
        "AP": 0.3861502431333065,
        "AP_CA": 39.99524466373026,
        "DAL": 1411.803270072937,
        "DAL_CA": 19719.095799560546
    }
}
Read more comments on GitHub >

github_iconTop Results From Across the Web

Too low BLEU score in reproducing simultaneous speech ...
I tried to train simultaneous speech translation following simul_mustc_example.md.
Read more >
End-to-End Simultaneous Speech to Text Translation
Streaming Simultaneous Speech Translation with Augmented Memory Transformer ... scoring 25 BLEU at medium latency (< 2s) and 30 BLEU at high ...
Read more >
A Gentle Introduction to Calculating the BLEU Score for Text in ...
BLEU, or the Bilingual Evaluation Understudy, is a score for comparing a candidate translation of text to one or more reference translations.
Read more >
AutoSimTrans 2022 Automatic Simultaneous Translation ...
End -to-End Simultaneous Speech Translation with Pretraining and Distillation: Huawei ... Table 5: The highest BLEU scores achieved by BIT-.
Read more >
Simultaneous neural machine translation with a reinforced ...
For online speech recognition, a sequence-to-sequence model incorporating hard and soft attention was proposed [23]. However, this method has a ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found