question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Run_summarization not working for mbart50

See original GitHub issue
  • transformers 4.5.0
  • Platform: linux:
  • Python version: 1.7.1
  • PyTorch version (GPU?):
  • Tensorflow version (GPU?):
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help

@patil-suraj @LysandreJik Models: mbart

I am running the run_summarization.py class using below commands: python examples/pytorch/summarization/run_summarization.py --model_name_or_path facebook/mbart-large-50 --do_train --do_eval --do_predict --test_file /home/aniruddha/mbart/mbart_json/bendev_mbart.json --train_file /home/aniruddha/mbart/mbart_json/bentrain_mbart.json --validation_file /home/aniruddha/mbart/mbart_json/bendev_mbart.json --text_column text --summary_column summary --output_dir mbart50_bengali-summarization --per_device_train_batch_size=1 --per_device_eval_batch_size=2 --overwrite_output_dir true --source_prefix "summarize: " --predict_with_generate yes

My dataset in json below format: I am doing it for bengali language: {“text”: “I’m sitting here in a boring room. It’s just another rainy Sunday afternoon. I’m wasting my time I got nothing to do. I’m hanging around I’m waiting for you. But nothing ever happens. And I wonder”, “summary”: “I’m sitting in a room where I’m waiting for something to happen”} Error:

File “/home/aniruddha/anaconda3/envs/mbart/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py”, line 295, in convert_ids_to_tokens index = int(index) TypeError: int() argument must be a string, a bytes-like object or a number, not ‘NoneType’

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
patil-surajcommented, Apr 30, 2021

Hi @Aniruddha-JU

Right now the run_summarization.py does not support fine-tuning mBART for summarization, we need to set the proper language tokens for mBART50. For now, you could easily modify the script to adapt it for mBART50 by setting the correct language tokens, as is done in the translation example.

https://github.com/huggingface/transformers/blob/master/examples/pytorch/translation/run_translation.py#L340-L380

The difference here would be that the source and target language will be similar.

Also, could you please post the full stack trace the error seems unrelated to mBART.

0reactions
rahul765commented, Dec 7, 2021

with self.tokenizer.as_target_tokenizer(): File “/home/rahulpal/anaconda3/envs/rebel/lib/python3.7/contextlib.py”, line 112, in enter return next(self.gen) File “/home/rahulpal/anaconda3/envs/rebel/lib/python3.7/site-packages/transformers/models/mbart50/tokenization_mbart50_fast.py”, line 215, in as_target_tokenizer self.set_tgt_lang_special_tokens(self.tgt_lang) File “/home/rahulpal/anaconda3/envs/rebel/lib/python3.7/site-packages/transformers/models/mbart50/tokenization_mbart50_fast.py”, line 240, in set_tgt_lang_special_tokens prefix_tokens_str = self.convert_ids_to_tokens(self.prefix_tokens) File “/home/rahulpal/anaconda3/envs/rebel/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py”, line 307, in convert_ids_to_tokens index = int(index) TypeError: int() argument must be a string, a bytes-like object or a number, not ‘NoneType’

Read more comments on GitHub >

github_iconTop Results From Across the Web

MBart and MBart-50 - Hugging Face
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the...
Read more >
finetune-mBART50-en-vi - Kaggle
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended...
Read more >
mBART50: Multilingual Fine-Tuning of Extensible Multilingual ...
I would like to see with more experiments if mBART solves that problem, and if yes, why. mBART50: Extending a Pretrained Model. In...
Read more >
How to reduce the execution time for translation using mBART ...
By limiting the no of words to be translated. model_inputs=tokenizer(text,return_tensors="pt", max_length=500, truncation=True).
Read more >
Multilingual Language Translation using Facebook's mBART ...
HuggingFace recently integrated Facebook AI's mBART-50 models which can be used to Translate text to, or between 50 languages.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found