Misalignment between documentation and implementation of mBART50 tokenisation for the decoder
See original GitHub issueSystem Info
transformers
version: 4.23.1- Platform: Linux-5.10.133±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.14
- Huggingface_hub version: 0.10.1
- PyTorch version (GPU?): 1.12.1+cu113 (False)
- Tensorflow version (GPU?): 2.8.2 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
The bug has been reproduced in the outputs of this colab notebook. The following are the steps to be followed:
- Make a copy of the notebook.
- Execute the first 2 cells.
- In the source file for mbart(
/usr/local/bin/python3.7/dist-packages/transformers/models/mbart/modeling_mbart.py
), on line 1352(aboveoutputs = self.model(...
, after theif labels is not None
block), addprint(f'Decoder Input Ids: {decoder_input_ids}\nLabels: {labels}')
. - Restart the runtime for the changes in the library to take place.
- Run the third cell. The output is:
Decoder Input Ids: tensor([[ 2, 250020, 47711, 7844, 127666, 8, 18347, 18147, 1362,
315, 42071, 36, 31563, 8454, 33796, 451, 346, 125577]])
Labels: tensor([[250020, 47711, 7844, 127666, 8, 18347, 18147, 1362, 315,
42071, 36, 31563, 8454, 33796, 451, 346, 125577, 2]])
Expected behavior
I was looking into fine-tuning facebook/mbart-large-50
through this example in the documentation. As per the description, the expected input for the model is of the form [lang_id] tokens [eos]
for both the encoder and the decoder.
While the MBart50Tokenizer
produces outputs in the expected format, the decoder_input_ids
get transformed to an incorrect one - [eos] [lang_id] tokens
. Specifically, I believe the output should have been the following(do correct me if I am wrong here though):
Decoder Input Ids: tensor([[ 250020, 47711, 7844, 127666, 8, 18347, 18147, 1362,
315, 42071, 36, 31563, 8454, 33796, 451, 346, 125577, 2]])
Labels: tensor([[47711, 7844, 127666, 8, 18347, 18147, 1362, 315,
42071, 36, 31563, 8454, 33796, 451, 346, 125577, 2, 250020]])
This is caused since the shift_tokens_right
function does not seem to be adapted for mbart50. As per the docstring of this function,
wrap the last non pad token (the [LID] token)
however, for an mbart50, the last non pad token would be an eos
.
Additional question: Why should the [eos]
token predict the [lang_id]
? This happens in both mbart and mbart50. If not, should the last token in the labels be -100
? If yes, there would be a subsequent issue, since the labels matrix from the tokenizer seems to be using 1
as the padding token instead of -100
. Do let me know if I would be required to open the same!
If this bug seems legitimate, I would be glad to provide a fix for the same! I believe the labels
key from MBart50Tokenizer would have to be updated to give the same output as the MBartTokenizer.
Issue Analytics
- State:
- Created a year ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Not stale, still looking forward to a response!
@ArthurZucker, when you have bandwidth, would you like to take a look at this?