question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Misalignment between documentation and implementation of mBART50 tokenisation for the decoder

See original GitHub issue

System Info

  • transformers version: 4.23.1
  • Platform: Linux-5.10.133±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.14
  • Huggingface_hub version: 0.10.1
  • PyTorch version (GPU?): 1.12.1+cu113 (False)
  • Tensorflow version (GPU?): 2.8.2 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@patil-suraj @SaulLu

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

The bug has been reproduced in the outputs of this colab notebook. The following are the steps to be followed:

  1. Make a copy of the notebook.
  2. Execute the first 2 cells.
  3. In the source file for mbart(/usr/local/bin/python3.7/dist-packages/transformers/models/mbart/modeling_mbart.py), on line 1352(above outputs = self.model(..., after the if labels is not None block), add print(f'Decoder Input Ids: {decoder_input_ids}\nLabels: {labels}').
  4. Restart the runtime for the changes in the library to take place.
  5. Run the third cell. The output is:
Decoder Input Ids: tensor([[     2, 250020,  47711,   7844, 127666,      8,  18347,  18147,   1362,
            315,  42071,     36,  31563,   8454,  33796,    451,    346, 125577]])
Labels: tensor([[250020,  47711,   7844, 127666,      8,  18347,  18147,   1362,    315,
          42071,     36,  31563,   8454,  33796,    451,    346, 125577,      2]])

Expected behavior

I was looking into fine-tuning facebook/mbart-large-50 through this example in the documentation. As per the description, the expected input for the model is of the form [lang_id] tokens [eos] for both the encoder and the decoder.

While the MBart50Tokenizer produces outputs in the expected format, the decoder_input_ids get transformed to an incorrect one - [eos] [lang_id] tokens. Specifically, I believe the output should have been the following(do correct me if I am wrong here though):

Decoder Input Ids: tensor([[   250020,  47711,   7844, 127666,      8,  18347,  18147,   1362,
            315,  42071,     36,  31563,   8454,  33796,    451,    346, 125577, 2]])
Labels: tensor([[47711,   7844, 127666,      8,  18347,  18147,   1362,    315,
          42071,     36,  31563,   8454,  33796,    451,    346, 125577,      2,  250020]])

This is caused since the shift_tokens_right function does not seem to be adapted for mbart50. As per the docstring of this function,

wrap the last non pad token (the [LID] token)

however, for an mbart50, the last non pad token would be an eos.

Additional question: Why should the [eos] token predict the [lang_id]? This happens in both mbart and mbart50. If not, should the last token in the labels be -100? If yes, there would be a subsequent issue, since the labels matrix from the tokenizer seems to be using 1 as the padding token instead of -100. Do let me know if I would be required to open the same!

If this bug seems legitimate, I would be glad to provide a fix for the same! I believe the labels key from MBart50Tokenizer would have to be updated to give the same output as the MBartTokenizer.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
devaansh100commented, Nov 11, 2022

Not stale, still looking forward to a response!

1reaction
LysandreJikcommented, Oct 12, 2022

@ArthurZucker, when you have bandwidth, would you like to take a look at this?

Read more comments on GitHub >

github_iconTop Results From Across the Web

MBart and MBart-50 — transformers 4.5.0.dev0 documentation
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An MBART...
Read more >
arXiv:2104.08200v3 [cs.CL] 9 Oct 2021
We build an encoder-decoder model us- ing the mBART architecture (Liu et al., 2020), which we train from scratch directly on each down-...
Read more >
A Multilingual Multiway Evaluation Data Set for Structured ...
We describe the second release of the software documentation data set for machine translation, our data set for structured document translation ...
Read more >
Multilingual Denoising Pre-training for Neural Machine ...
P-Transformer: Towards Better Document-to-Document Neural Machine ... transfer learning [20] and MBART50 for decoder initialization [6, 21].
Read more >
transformers · PyPI
Here is how to quickly use a pipeline to classify positive versus negative texts: >>> from transformers import pipeline # Allocate a pipeline...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found