Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issues finetuning MBART 50 many to many

See original GitHub issue

transformers version: Latest
Platform:
Python version: 1.8.0
Using GPU in script?: Yes A100
Using distributed or parallel set-up in script?: No

I am trying to finetune MBART50-many-to-many

python ./transformers/examples/seq2seq/run_translation.py \
    --model_name_or_path facebook/mbart-large-50-many-to-many-mmt \
    --do_train \
    --do_eval \
    --source_lang ru_RU \
    --target_lang en_XX \
    --train_file ./corpus_v2/train.json \
    --validation_file ./corpus_v2/valid.json \
    --output_dir /local/nlpswordfish/tuhin/mbart50/tst-translation \
    --per_device_train_batch_size=32 \
    --per_device_eval_batch_size=8 \
    --overwrite_output_dir \
    --predict_with_generate \
    --max_train_samples 51373 \
    --max_val_samples 6424 \
    --gradient_accumulation_steps 1\
    --num_train_epochs 8 \
    --save_strategy epoch \
    --evaluation_strategy epoch

Even though I explicitly pass Src lang as ru_RU and Target as en_XX I get an error and see my log. I tried printing Src and Tgt language


 Assigning ['ar_AR', 'cs_CZ', 'de_DE', 'en_XX', 'es_XX', 'et_EE', 'fi_FI', 'fr_XX', 'gu_IN', 'hi_IN', 'it_IT', 'ja_XX', 'kk_KZ', 'ko_KR', 'lt_LT', 'lv_LV', 'my_MM', 'ne_NP', 'nl_XX', 'ro_RO', 'ru_RU', 'si_LK', 'tr_TR', 'vi_VN', 'zh_CN', 'af_ZA', 'az_AZ', 'bn_IN', 'fa_IR', 'he_IL', 'hr_HR', 'id_ID', 'ka_GE', 'km_KH', 'mk_MK', 'ml_IN', 'mn_MN', 'mr_IN', 'pl_PL', 'ps_AF', 'pt_XX', 'sv_SE', 'sw_KE', 'ta_IN', 'te_IN', 'th_TH', 'tl_XX', 'uk_UA', 'ur_PK', 'xh_ZA', 'gl_ES', 'sl_SI'] to the additional_special_tokens key of the tokenizer
 Src lang is  en_XX
 ids [250004]
 ids [2]
 loading weights file https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt/resolve/main/pytorch_model.bin from cache at /home/tuhin.chakr/.cache/huggingface/transformers/e33fcda1a71396b8475e16e2fe1458cfa62c6013f8cb3787d6aa4364ec5251c6.d802a5ca7720894045dd2c9dcee6069d27aa92fbbe33f52b44d479538dc3ccc3
 All model checkpoint weights were used when initializing MBartForConditionalGeneration.
 
 All the weights of MBartForConditionalGeneration were initialized from the model checkpoint at facebook/mbart-large-50-many-to-many-mmt.
 If your task is similar to the task the model of the checkpoint was trained on, you can already use MBartForConditionalGeneration for predictions without further training.
 Tgt lang is  None
 self.prefix_tokens is [None]
 ids [None]
 Traceback (most recent call last):
   File "./transformers/examples/seq2seq/run_translation.py", line 564, in <module
     main()
   File "./transformers/examples/seq2seq/run_translation.py", line 403, in main
     train_dataset = train_dataset.map(
   File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1289, in map
     update_data = does_function_return_dict(test_inputs, test_indices)
   File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1260, in does_function_return_dict
     function(*fn_args, indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
   File "./transformers/examples/seq2seq/run_translation.py", line 384, in preprocess_function
     with tokenizer.as_target_tokenizer():
   File "/home/tuhin.chakr/yes/lib/python3.8/contextlib.py", line 113, in __enter__
     return next(self.gen)
   File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/transformers/models/mbart/tokenization_mbart50_fast.py", line 242, in as_target_tokenizer
     self.set_tgt_lang_special_tokens(self.tgt_lang)
   File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/transformers/models/mbart/tokenization_mbart50_fast.py", line 269, in set_tgt_lang_special_tokens
     prefix_tokens_str = self.convert_ids_to_tokens(self.prefix_tokens)
   File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 287, in convert_ids_to_tokens
     index = int(index)
 TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

Also as far I understand in many to many for finetuning it requires some separate processing based on the paper which is missing ?

What should be the data format. Additionally will u guys release a many to one model as well ? although many to one is a subset of many to many

@patrickvonplaten, @patil-suraj

Issue Analytics

State:
Created 3 years ago
Comments:21 (8 by maintainers)

Top GitHub Comments

2reactions

patil-surajcommented, Apr 7, 2021

The many to one checkpoint is now available on the hub https://huggingface.co/facebook/mbart-large-50-many-to-one-mmt

1reaction

tuhinjubcsecommented, Apr 7, 2021

That would be really helpful if you can have a notebook which documents how to do that , or even a read me , just so that its clear

Top Results From Across the Web

mBART50 Translation/Fine Tuning with Many-to-One Model ...

Hello, I think there is something wrong with the arch. If I get the args for the mbart50.ft.n1 model it says "arch": "denoising_large"....

MBart and MBart-50 - Hugging Face

Instead of finetuning on one direction, a pretrained model is finetuned on many directions at the same time. It demonstrates that pretrained models...

mBART50: Multilingual Fine-Tuning of Extensible Multilingual ...

Extend mBART to 50 languages with no loss of accuracy on the bilingual fine-tuning; Propose multilingual fine-tuning for Many-to-English, ...

arXiv:2008.00401v1 [cs.CL] 2 Aug 2020

We perform multilingual finetuning on the existing mBART model. On average, multilingual finetuning (ML-FT) improves 1.0 BLEU in Many-to-one. (N ...

Hugging Face Pre-trained Models: Find the Best One for Your ...

Hugging Face has multiple transformers and models but they are specific to ... in os.walk('/content/mbart-large-50-one-to-many-mmt-finetuned-en-to-de'): for ...