Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Misalignment between documentation and implementation of mBART50 tokenisation for the decoder

See original GitHub issue

System Info

transformers version: 4.23.1
Platform: Linux-5.10.133±x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.14
Huggingface_hub version: 0.10.1
PyTorch version (GPU?): 1.12.1+cu113 (False)
Tensorflow version (GPU?): 2.8.2 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help?

@patil-suraj @SaulLu

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

The bug has been reproduced in the outputs of this colab notebook. The following are the steps to be followed:

Make a copy of the notebook.
Execute the first 2 cells.
In the source file for mbart(/usr/local/bin/python3.7/dist-packages/transformers/models/mbart/modeling_mbart.py), on line 1352(above outputs = self.model(..., after the if labels is not None block), add print(f'Decoder Input Ids: {decoder_input_ids}\nLabels: {labels}').
Restart the runtime for the changes in the library to take place.
Run the third cell. The output is:

Decoder Input Ids: tensor([[     2, 250020,  47711,   7844, 127666,      8,  18347,  18147,   1362,
            315,  42071,     36,  31563,   8454,  33796,    451,    346, 125577]])
Labels: tensor([[250020,  47711,   7844, 127666,      8,  18347,  18147,   1362,    315,
          42071,     36,  31563,   8454,  33796,    451,    346, 125577,      2]])

Expected behavior

I was looking into fine-tuning facebook/mbart-large-50 through this example in the documentation. As per the description, the expected input for the model is of the form [lang_id] tokens [eos] for both the encoder and the decoder.

While the MBart50Tokenizer produces outputs in the expected format, the decoder_input_ids get transformed to an incorrect one - [eos] [lang_id] tokens. Specifically, I believe the output should have been the following(do correct me if I am wrong here though):

Decoder Input Ids: tensor([[   250020,  47711,   7844, 127666,      8,  18347,  18147,   1362,
            315,  42071,     36,  31563,   8454,  33796,    451,    346, 125577, 2]])
Labels: tensor([[47711,   7844, 127666,      8,  18347,  18147,   1362,    315,
          42071,     36,  31563,   8454,  33796,    451,    346, 125577,      2,  250020]])

This is caused since the shift_tokens_right function does not seem to be adapted for mbart50. As per the docstring of this function,

wrap the last non pad token (the [LID] token)

however, for an mbart50, the last non pad token would be an eos.

Additional question: Why should the [eos] token predict the [lang_id]? This happens in both mbart and mbart50. If not, should the last token in the labels be -100? If yes, there would be a subsequent issue, since the labels matrix from the tokenizer seems to be using 1 as the padding token instead of -100. Do let me know if I would be required to open the same!

If this bug seems legitimate, I would be glad to provide a fix for the same! I believe the labels key from MBart50Tokenizer would have to be updated to give the same output as the MBartTokenizer.