question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issues finetuning MBART 50 many to many

See original GitHub issue
  • transformers version: Latest
  • Platform:
  • Python version: 1.8.0
  • Using GPU in script?: Yes A100
  • Using distributed or parallel set-up in script?: No

I am trying to finetune MBART50-many-to-many

python ./transformers/examples/seq2seq/run_translation.py \
    --model_name_or_path facebook/mbart-large-50-many-to-many-mmt \
    --do_train \
    --do_eval \
    --source_lang ru_RU \
    --target_lang en_XX \
    --train_file ./corpus_v2/train.json \
    --validation_file ./corpus_v2/valid.json \
    --output_dir /local/nlpswordfish/tuhin/mbart50/tst-translation \
    --per_device_train_batch_size=32 \
    --per_device_eval_batch_size=8 \
    --overwrite_output_dir \
    --predict_with_generate \
    --max_train_samples 51373 \
    --max_val_samples 6424 \
    --gradient_accumulation_steps 1\
    --num_train_epochs 8 \
    --save_strategy epoch \
    --evaluation_strategy epoch

Even though I explicitly pass Src lang as ru_RU and Target as en_XX I get an error and see my log. I tried printing Src and Tgt language


 Assigning ['ar_AR', 'cs_CZ', 'de_DE', 'en_XX', 'es_XX', 'et_EE', 'fi_FI', 'fr_XX', 'gu_IN', 'hi_IN', 'it_IT', 'ja_XX', 'kk_KZ', 'ko_KR', 'lt_LT', 'lv_LV', 'my_MM', 'ne_NP', 'nl_XX', 'ro_RO', 'ru_RU', 'si_LK', 'tr_TR', 'vi_VN', 'zh_CN', 'af_ZA', 'az_AZ', 'bn_IN', 'fa_IR', 'he_IL', 'hr_HR', 'id_ID', 'ka_GE', 'km_KH', 'mk_MK', 'ml_IN', 'mn_MN', 'mr_IN', 'pl_PL', 'ps_AF', 'pt_XX', 'sv_SE', 'sw_KE', 'ta_IN', 'te_IN', 'th_TH', 'tl_XX', 'uk_UA', 'ur_PK', 'xh_ZA', 'gl_ES', 'sl_SI'] to the additional_special_tokens key of the tokenizer
 Src lang is  en_XX
 ids [250004]
 ids [2]
 loading weights file https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt/resolve/main/pytorch_model.bin from cache at /home/tuhin.chakr/.cache/huggingface/transformers/e33fcda1a71396b8475e16e2fe1458cfa62c6013f8cb3787d6aa4364ec5251c6.d802a5ca7720894045dd2c9dcee6069d27aa92fbbe33f52b44d479538dc3ccc3
 All model checkpoint weights were used when initializing MBartForConditionalGeneration.
 
 All the weights of MBartForConditionalGeneration were initialized from the model checkpoint at facebook/mbart-large-50-many-to-many-mmt.
 If your task is similar to the task the model of the checkpoint was trained on, you can already use MBartForConditionalGeneration for predictions without further training.
 Tgt lang is  None
 self.prefix_tokens is [None]
 ids [None]
 Traceback (most recent call last):
   File "./transformers/examples/seq2seq/run_translation.py", line 564, in <module
     main()
   File "./transformers/examples/seq2seq/run_translation.py", line 403, in main
     train_dataset = train_dataset.map(
   File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1289, in map
     update_data = does_function_return_dict(test_inputs, test_indices)
   File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1260, in does_function_return_dict
     function(*fn_args, indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
   File "./transformers/examples/seq2seq/run_translation.py", line 384, in preprocess_function
     with tokenizer.as_target_tokenizer():
   File "/home/tuhin.chakr/yes/lib/python3.8/contextlib.py", line 113, in __enter__
     return next(self.gen)
   File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/transformers/models/mbart/tokenization_mbart50_fast.py", line 242, in as_target_tokenizer
     self.set_tgt_lang_special_tokens(self.tgt_lang)
   File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/transformers/models/mbart/tokenization_mbart50_fast.py", line 269, in set_tgt_lang_special_tokens
     prefix_tokens_str = self.convert_ids_to_tokens(self.prefix_tokens)
   File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 287, in convert_ids_to_tokens
     index = int(index)
 TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
  

Also as far I understand in many to many for finetuning it requires some separate processing based on the paper which is missing ?

image

What should be the data format. Additionally will u guys release a many to one model as well ? although many to one is a subset of many to many

@patrickvonplaten, @patil-suraj

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:21 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
patil-surajcommented, Apr 7, 2021

The many to one checkpoint is now available on the hub https://huggingface.co/facebook/mbart-large-50-many-to-one-mmt

1reaction
tuhinjubcsecommented, Apr 7, 2021

That would be really helpful if you can have a notebook which documents how to do that , or even a read me , just so that its clear

Read more comments on GitHub >

github_iconTop Results From Across the Web

mBART50 Translation/Fine Tuning with Many-to-One Model ...
Hello, I think there is something wrong with the arch. If I get the args for the mbart50.ft.n1 model it says "arch": "denoising_large"....
Read more >
MBart and MBart-50 - Hugging Face
Instead of finetuning on one direction, a pretrained model is finetuned on many directions at the same time. It demonstrates that pretrained models...
Read more >
mBART50: Multilingual Fine-Tuning of Extensible Multilingual ...
Extend mBART to 50 languages with no loss of accuracy on the bilingual fine-tuning; Propose multilingual fine-tuning for Many-to-English, ...
Read more >
arXiv:2008.00401v1 [cs.CL] 2 Aug 2020
We perform multilingual finetuning on the existing mBART model. On average, multilingual finetuning (ML-FT) improves 1.0 BLEU in Many-to-one. (N ...
Read more >
Hugging Face Pre-trained Models: Find the Best One for Your ...
Hugging Face has multiple transformers and models but they are specific to ... in os.walk('/content/mbart-large-50-one-to-many-mmt-finetuned-en-to-de'): for ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found