Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Run_summarization not working for mbart50

See original GitHub issue

transformers 4.5.0
Platform: linux:
Python version: 1.7.1
PyTorch version (GPU?):
Tensorflow version (GPU?):
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help

@patil-suraj @LysandreJik Models: mbart

I am running the run_summarization.py class using below commands: python examples/pytorch/summarization/run_summarization.py --model_name_or_path facebook/mbart-large-50 --do_train --do_eval --do_predict --test_file /home/aniruddha/mbart/mbart_json/bendev_mbart.json --train_file /home/aniruddha/mbart/mbart_json/bentrain_mbart.json --validation_file /home/aniruddha/mbart/mbart_json/bendev_mbart.json --text_column text --summary_column summary --output_dir mbart50_bengali-summarization --per_device_train_batch_size=1 --per_device_eval_batch_size=2 --overwrite_output_dir true --source_prefix "summarize: " --predict_with_generate yes

My dataset in json below format: I am doing it for bengali language: {“text”: “I’m sitting here in a boring room. It’s just another rainy Sunday afternoon. I’m wasting my time I got nothing to do. I’m hanging around I’m waiting for you. But nothing ever happens. And I wonder”, “summary”: “I’m sitting in a room where I’m waiting for something to happen”} Error:

File “/home/aniruddha/anaconda3/envs/mbart/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py”, line 295, in convert_ids_to_tokens index = int(index) TypeError: int() argument must be a string, a bytes-like object or a number, not ‘NoneType’

Issue Analytics

State:
Created 2 years ago
Comments:6 (1 by maintainers)

Top GitHub Comments

1reaction

patil-surajcommented, Apr 30, 2021

Hi @Aniruddha-JU

Right now the run_summarization.py does not support fine-tuning mBART for summarization, we need to set the proper language tokens for mBART50. For now, you could easily modify the script to adapt it for mBART50 by setting the correct language tokens, as is done in the translation example.

https://github.com/huggingface/transformers/blob/master/examples/pytorch/translation/run_translation.py#L340-L380

The difference here would be that the source and target language will be similar.

Also, could you please post the full stack trace the error seems unrelated to mBART.

0reactions

rahul765commented, Dec 7, 2021

with self.tokenizer.as_target_tokenizer(): File “/home/rahulpal/anaconda3/envs/rebel/lib/python3.7/contextlib.py”, line 112, in enter return next(self.gen) File “/home/rahulpal/anaconda3/envs/rebel/lib/python3.7/site-packages/transformers/models/mbart50/tokenization_mbart50_fast.py”, line 215, in as_target_tokenizer self.set_tgt_lang_special_tokens(self.tgt_lang) File “/home/rahulpal/anaconda3/envs/rebel/lib/python3.7/site-packages/transformers/models/mbart50/tokenization_mbart50_fast.py”, line 240, in set_tgt_lang_special_tokens prefix_tokens_str = self.convert_ids_to_tokens(self.prefix_tokens) File “/home/rahulpal/anaconda3/envs/rebel/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py”, line 307, in convert_ids_to_tokens index = int(index) TypeError: int() argument must be a string, a bytes-like object or a number, not ‘NoneType’

Top Results From Across the Web

MBart and MBart-50 - Hugging Face

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the...

finetune-mBART50-en-vi - Kaggle

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended...

mBART50: Multilingual Fine-Tuning of Extensible Multilingual ...

I would like to see with more experiments if mBART solves that problem, and if yes, why. mBART50: Extending a Pretrained Model. In...

How to reduce the execution time for translation using mBART ...

By limiting the no of words to be translated. model_inputs=tokenizer(text,return_tensors="pt", max_length=500, truncation=True).

Multilingual Language Translation using Facebook's mBART ...

HuggingFace recently integrated Facebook AI's mBART-50 models which can be used to Translate text to, or between 50 languages.