Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Possible error in MBart Tokenization script -- target lang code is only present in seq once

See original GitHub issue

Environment info

transformers version: current
Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.6.0+cu101 (False)
Using GPU in script?: No.
Using distributed or parallel set-up in script?: No.

Who can help

MBart: @sshleifer

Information

Model I am using is MBart.

The problem arises when using:

[x ] the official example scripts: (give details below)
my own modified scripts: (give details below)

To reproduce

Steps to reproduce the behavior:

from transformers import MBartTokenizer
tokenizer = MBartTokenizer.from_pretrained('facebook/mbart-large-en-ro')
example_english_phrase = " UN Chief Says There Is No Military Solution in Syria"
expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
batch: dict = tokenizer.prepare_seq2seq_batch(
     example_english_phrase, src_lang="en_XX", tgt_lang="ro_RO", tgt_texts=expected_translation_romanian
)

-snip-
'labels': tensor([[ 47711,   7844, 127666,      8,  18347,  18147,   1362,    315, 42071,     36,  31563,   8454,  33796,    451,    346, 125577,      2, 250020]])}

The target language code is only present once in the target sequence.

print(tokenizer.lang_code_to_id["ro_RO"]) 250020

Expected behavior

'labels': tensor([[ 250020, 47711,   7844, 127666,      8,  18347,  18147,   1362,    315, 42071,     36,  31563,   8454,  33796,    451,    346, 125577, 2, 250020]])}

Here, the target language code is first and last, as I believe MBart (https://arxiv.org/pdf/2001.08210.pdf, top of page 3) says.

MBart Excerpt:

For each instance of a batch we sample a language id symbol <LID> ... 
sentences in the instance are separated by the end of sentence (</S>) token. Then, we append the selected<LID>

Here is the code I believe is wrong:

    def set_tgt_lang_special_tokens(self, lang: str) -> None:
        """Reset the special tokens to the target language setting. Prefix [tgt_lang_code], suffix =[eos]."""
        self.cur_lang_code = self.lang_code_to_id[lang]
        self.prefix_tokens = []
        self.suffix_tokens = [self.eos_token_id, self.cur_lang_code]

To me, the comment implies the language code should be first as well.

I tested it locally, and merely adding self.cur_lang_code to self.prefix_tokens resolves the issue.

I do not know if I am misunderstanding the purpose of this script or misuing it. My above code is copied from the “MBartTokenizer” example at https://huggingface.co/transformers/master/model_doc/mbart.html#overview

If I didn’t make a mistake, I’d be more than happy to open a PR to change that one lines and fix it.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

sshleifercommented, Sep 28, 2020

You are missing the distinction between decoder_input_ids and labels I think. For mbart-large-en-ro we have decoder_start_token_id=250020 for this reason.

Then in finetune.py:

decoder_input_ids = shift_tokens_right(tgt_ids, pad_token_id)
outputs = self(src_ids, attention_mask=src_mask, decoder_input_ids=decoder_input_ids, use_cache=False)

shift_tokens_right moves the language code to the 0th column of decoder_input_ids.

You can also read this which is related.

I would definitely welcome a contribution to the docs that explained this clearly!

0reactions

Sun694commented, Sep 28, 2020

It’s also worth noting that if there is no dedicated BOS (like MBart), then during inference, you have no natural way to tell the decoder to start generating during inference – the model never has predicted the first token of a sequence.

The example at https://huggingface.co/transformers/master/model_doc/mbart.html#overview prepends the language code during inference, but if that is not done during training as well, this causes domain shift.

Unless the decoder (or something else) is editing targets behind-the-scenes (beyond “shifting” indexes one during training), I believe the current method of preparing batches is introducing domain shift.

Top Results From Across the Web

MBart — transformers 3.3.0 documentation - Hugging Face

The MBart model was presented in Multilingual Denoising Pre-training for Neural ... A special language id token is added in both the source...

Command-line Tools — fairseq 0.12.2 documentation

Fairseq provides several command-line tools for training and evaluating models: fairseq-preprocess: Data pre-processing: build vocabularies and binarize ...

A pipeline for large raw text preprocessing and model training ...

Regarding training, learning from scratch large models presents several challenges, ... Comparison between some of the available language-specific models .

tokenize — Tokenizer for Python source — Python 3.11.1 ...

Source code: Lib/tokenize.py The tokenize module provides a lexical scanner for Python source code, implemented in Python. The scanner in this module ...

Multilingual Denoising Pre-training for Neural Machine ...

We present mBART—a sequence-to-sequence denoising auto-encoder ... For example, fine-tuning on bi-text in one language pair (e.g., ...