Issues finetuning MBART 50 many to many
See original GitHub issuetransformers
version: Latest- Platform:
- Python version: 1.8.0
- Using GPU in script?: Yes A100
- Using distributed or parallel set-up in script?: No
I am trying to finetune MBART50-many-to-many
python ./transformers/examples/seq2seq/run_translation.py \
--model_name_or_path facebook/mbart-large-50-many-to-many-mmt \
--do_train \
--do_eval \
--source_lang ru_RU \
--target_lang en_XX \
--train_file ./corpus_v2/train.json \
--validation_file ./corpus_v2/valid.json \
--output_dir /local/nlpswordfish/tuhin/mbart50/tst-translation \
--per_device_train_batch_size=32 \
--per_device_eval_batch_size=8 \
--overwrite_output_dir \
--predict_with_generate \
--max_train_samples 51373 \
--max_val_samples 6424 \
--gradient_accumulation_steps 1\
--num_train_epochs 8 \
--save_strategy epoch \
--evaluation_strategy epoch
Even though I explicitly pass Src lang as ru_RU and Target as en_XX I get an error and see my log. I tried printing Src and Tgt language
Assigning ['ar_AR', 'cs_CZ', 'de_DE', 'en_XX', 'es_XX', 'et_EE', 'fi_FI', 'fr_XX', 'gu_IN', 'hi_IN', 'it_IT', 'ja_XX', 'kk_KZ', 'ko_KR', 'lt_LT', 'lv_LV', 'my_MM', 'ne_NP', 'nl_XX', 'ro_RO', 'ru_RU', 'si_LK', 'tr_TR', 'vi_VN', 'zh_CN', 'af_ZA', 'az_AZ', 'bn_IN', 'fa_IR', 'he_IL', 'hr_HR', 'id_ID', 'ka_GE', 'km_KH', 'mk_MK', 'ml_IN', 'mn_MN', 'mr_IN', 'pl_PL', 'ps_AF', 'pt_XX', 'sv_SE', 'sw_KE', 'ta_IN', 'te_IN', 'th_TH', 'tl_XX', 'uk_UA', 'ur_PK', 'xh_ZA', 'gl_ES', 'sl_SI'] to the additional_special_tokens key of the tokenizer
Src lang is en_XX
ids [250004]
ids [2]
loading weights file https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt/resolve/main/pytorch_model.bin from cache at /home/tuhin.chakr/.cache/huggingface/transformers/e33fcda1a71396b8475e16e2fe1458cfa62c6013f8cb3787d6aa4364ec5251c6.d802a5ca7720894045dd2c9dcee6069d27aa92fbbe33f52b44d479538dc3ccc3
All model checkpoint weights were used when initializing MBartForConditionalGeneration.
All the weights of MBartForConditionalGeneration were initialized from the model checkpoint at facebook/mbart-large-50-many-to-many-mmt.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MBartForConditionalGeneration for predictions without further training.
Tgt lang is None
self.prefix_tokens is [None]
ids [None]
Traceback (most recent call last):
File "./transformers/examples/seq2seq/run_translation.py", line 564, in <module
main()
File "./transformers/examples/seq2seq/run_translation.py", line 403, in main
train_dataset = train_dataset.map(
File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1289, in map
update_data = does_function_return_dict(test_inputs, test_indices)
File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1260, in does_function_return_dict
function(*fn_args, indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
File "./transformers/examples/seq2seq/run_translation.py", line 384, in preprocess_function
with tokenizer.as_target_tokenizer():
File "/home/tuhin.chakr/yes/lib/python3.8/contextlib.py", line 113, in __enter__
return next(self.gen)
File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/transformers/models/mbart/tokenization_mbart50_fast.py", line 242, in as_target_tokenizer
self.set_tgt_lang_special_tokens(self.tgt_lang)
File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/transformers/models/mbart/tokenization_mbart50_fast.py", line 269, in set_tgt_lang_special_tokens
prefix_tokens_str = self.convert_ids_to_tokens(self.prefix_tokens)
File "/home/tuhin.chakr/yes/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 287, in convert_ids_to_tokens
index = int(index)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
Also as far I understand in many to many for finetuning it requires some separate processing based on the paper which is missing ?
What should be the data format. Additionally will u guys release a many to one model as well ? although many to one is a subset of many to many
Issue Analytics
- State:
- Created 3 years ago
- Comments:21 (8 by maintainers)
Top Results From Across the Web
mBART50 Translation/Fine Tuning with Many-to-One Model ...
Hello, I think there is something wrong with the arch. If I get the args for the mbart50.ft.n1 model it says "arch": "denoising_large"....
Read more >MBart and MBart-50 - Hugging Face
Instead of finetuning on one direction, a pretrained model is finetuned on many directions at the same time. It demonstrates that pretrained models...
Read more >mBART50: Multilingual Fine-Tuning of Extensible Multilingual ...
Extend mBART to 50 languages with no loss of accuracy on the bilingual fine-tuning; Propose multilingual fine-tuning for Many-to-English, ...
Read more >arXiv:2008.00401v1 [cs.CL] 2 Aug 2020
We perform multilingual finetuning on the existing mBART model. On average, multilingual finetuning (ML-FT) improves 1.0 BLEU in Many-to-one. (N ...
Read more >Hugging Face Pre-trained Models: Find the Best One for Your ...
Hugging Face has multiple transformers and models but they are specific to ... in os.walk('/content/mbart-large-50-one-to-many-mmt-finetuned-en-to-de'): for ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The many to one checkpoint is now available on the hub https://huggingface.co/facebook/mbart-large-50-many-to-one-mmt
That would be really helpful if you can have a notebook which documents how to do that , or even a read me , just so that its clear