Possible error in MBart Tokenization script -- target lang code is only present in seq once
See original GitHub issueEnvironment info
transformers
version: current- Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.9
- PyTorch version (GPU?): 1.6.0+cu101 (False)
- Using GPU in script?: No.
- Using distributed or parallel set-up in script?: No.
Who can help
MBart: @sshleifer
Information
Model I am using is MBart.
The problem arises when using:
- [x ] the official example scripts: (give details below)
- my own modified scripts: (give details below)
To reproduce
Steps to reproduce the behavior:
from transformers import MBartTokenizer
tokenizer = MBartTokenizer.from_pretrained('facebook/mbart-large-en-ro')
example_english_phrase = " UN Chief Says There Is No Military Solution in Syria"
expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
batch: dict = tokenizer.prepare_seq2seq_batch(
example_english_phrase, src_lang="en_XX", tgt_lang="ro_RO", tgt_texts=expected_translation_romanian
)
-snip-
'labels': tensor([[ 47711, 7844, 127666, 8, 18347, 18147, 1362, 315, 42071, 36, 31563, 8454, 33796, 451, 346, 125577, 2, 250020]])}
The target language code is only present once in the target sequence.
print(tokenizer.lang_code_to_id["ro_RO"])
250020
Expected behavior
'labels': tensor([[ 250020, 47711, 7844, 127666, 8, 18347, 18147, 1362, 315, 42071, 36, 31563, 8454, 33796, 451, 346, 125577, 2, 250020]])}
Here, the target language code is first and last, as I believe MBart (https://arxiv.org/pdf/2001.08210.pdf, top of page 3) says.
MBart Excerpt:
For each instance of a batch we sample a language id symbol <LID> ...
sentences in the instance are separated by the end of sentence (</S>) token. Then, we append the selected<LID>
Here is the code I believe is wrong:
def set_tgt_lang_special_tokens(self, lang: str) -> None:
"""Reset the special tokens to the target language setting. Prefix [tgt_lang_code], suffix =[eos]."""
self.cur_lang_code = self.lang_code_to_id[lang]
self.prefix_tokens = []
self.suffix_tokens = [self.eos_token_id, self.cur_lang_code]
To me, the comment implies the language code should be first as well.
I tested it locally, and merely adding self.cur_lang_code
to self.prefix_tokens
resolves the issue.
I do not know if I am misunderstanding the purpose of this script or misuing it. My above code is copied from the “MBartTokenizer” example at https://huggingface.co/transformers/master/model_doc/mbart.html#overview
If I didn’t make a mistake, I’d be more than happy to open a PR to change that one lines and fix it.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:5 (2 by maintainers)
You are missing the distinction between
decoder_input_ids
andlabels
I think. Formbart-large-en-ro
we havedecoder_start_token_id=250020
for this reason.Then in finetune.py:
shift_tokens_right
moves the language code to the 0th column ofdecoder_input_ids
.You can also read this which is related.
I would definitely welcome a contribution to the docs that explained this clearly!
It’s also worth noting that if there is no dedicated BOS (like MBart), then during inference, you have no natural way to tell the decoder to start generating during inference – the model never has predicted the first token of a sequence.
The example at https://huggingface.co/transformers/master/model_doc/mbart.html#overview prepends the language code during inference, but if that is not done during training as well, this causes domain shift.
Unless the decoder (or something else) is editing targets behind-the-scenes (beyond “shifting” indexes one during training), I believe the current method of preparing batches is introducing domain shift.