question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Possible error in MBart Tokenization script -- target lang code is only present in seq once

See original GitHub issue

Environment info

  • transformers version: current
  • Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.6.0+cu101 (False)
  • Using GPU in script?: No.
  • Using distributed or parallel set-up in script?: No.

Who can help

MBart: @sshleifer

Information

Model I am using is MBart.

The problem arises when using:

  • [x ] the official example scripts: (give details below)
  • my own modified scripts: (give details below)

To reproduce

Steps to reproduce the behavior:

from transformers import MBartTokenizer
tokenizer = MBartTokenizer.from_pretrained('facebook/mbart-large-en-ro')
example_english_phrase = " UN Chief Says There Is No Military Solution in Syria"
expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
batch: dict = tokenizer.prepare_seq2seq_batch(
     example_english_phrase, src_lang="en_XX", tgt_lang="ro_RO", tgt_texts=expected_translation_romanian
)
-snip-
'labels': tensor([[ 47711,   7844, 127666,      8,  18347,  18147,   1362,    315, 42071,     36,  31563,   8454,  33796,    451,    346, 125577,      2, 250020]])}

The target language code is only present once in the target sequence.

print(tokenizer.lang_code_to_id["ro_RO"]) 250020

Expected behavior

'labels': tensor([[ 250020, 47711,   7844, 127666,      8,  18347,  18147,   1362,    315, 42071,     36,  31563,   8454,  33796,    451,    346, 125577, 2, 250020]])}

Here, the target language code is first and last, as I believe MBart (https://arxiv.org/pdf/2001.08210.pdf, top of page 3) says.

MBart Excerpt:

For each instance of a batch we sample a language id symbol <LID> ... 
sentences in the instance are separated by the end of sentence (</S>) token. Then, we append the selected<LID>

Here is the code I believe is wrong:

    def set_tgt_lang_special_tokens(self, lang: str) -> None:
        """Reset the special tokens to the target language setting. Prefix [tgt_lang_code], suffix =[eos]."""
        self.cur_lang_code = self.lang_code_to_id[lang]
        self.prefix_tokens = []
        self.suffix_tokens = [self.eos_token_id, self.cur_lang_code]

To me, the comment implies the language code should be first as well.

I tested it locally, and merely adding self.cur_lang_code to self.prefix_tokens resolves the issue.

I do not know if I am misunderstanding the purpose of this script or misuing it. My above code is copied from the “MBartTokenizer” example at https://huggingface.co/transformers/master/model_doc/mbart.html#overview

If I didn’t make a mistake, I’d be more than happy to open a PR to change that one lines and fix it.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
sshleifercommented, Sep 28, 2020

You are missing the distinction between decoder_input_ids and labels I think. For mbart-large-en-ro we have decoder_start_token_id=250020 for this reason.

Then in finetune.py:

decoder_input_ids = shift_tokens_right(tgt_ids, pad_token_id)
outputs = self(src_ids, attention_mask=src_mask, decoder_input_ids=decoder_input_ids, use_cache=False)

shift_tokens_right moves the language code to the 0th column of decoder_input_ids.

You can also read this which is related.

I would definitely welcome a contribution to the docs that explained this clearly!

0reactions
Sun694commented, Sep 28, 2020

It’s also worth noting that if there is no dedicated BOS (like MBart), then during inference, you have no natural way to tell the decoder to start generating during inference – the model never has predicted the first token of a sequence.

The example at https://huggingface.co/transformers/master/model_doc/mbart.html#overview prepends the language code during inference, but if that is not done during training as well, this causes domain shift.

Unless the decoder (or something else) is editing targets behind-the-scenes (beyond “shifting” indexes one during training), I believe the current method of preparing batches is introducing domain shift.

Read more comments on GitHub >

github_iconTop Results From Across the Web

MBart — transformers 3.3.0 documentation - Hugging Face
The MBart model was presented in Multilingual Denoising Pre-training for Neural ... A special language id token is added in both the source...
Read more >
Command-line Tools — fairseq 0.12.2 documentation
Fairseq provides several command-line tools for training and evaluating models: fairseq-preprocess: Data pre-processing: build vocabularies and binarize ...
Read more >
A pipeline for large raw text preprocessing and model training ...
Regarding training, learning from scratch large models presents several challenges, ... Comparison between some of the available language-specific models .
Read more >
tokenize — Tokenizer for Python source — Python 3.11.1 ...
Source code: Lib/tokenize.py The tokenize module provides a lexical scanner for Python source code, implemented in Python. The scanner in this module ...
Read more >
Multilingual Denoising Pre-training for Neural Machine ...
We present mBART—a sequence-to-sequence denoising auto-encoder ... For example, fine-tuning on bi-text in one language pair (e.g., ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found