Run_summarization not working for mbart50
See original GitHub issuetransformers
4.5.0- Platform: linux:
- Python version: 1.7.1
- PyTorch version (GPU?):
- Tensorflow version (GPU?):
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help
@patil-suraj @LysandreJik Models: mbart
I am running the run_summarization.py class using below commands: python examples/pytorch/summarization/run_summarization.py --model_name_or_path facebook/mbart-large-50 --do_train --do_eval --do_predict --test_file /home/aniruddha/mbart/mbart_json/bendev_mbart.json --train_file /home/aniruddha/mbart/mbart_json/bentrain_mbart.json --validation_file /home/aniruddha/mbart/mbart_json/bendev_mbart.json --text_column text --summary_column summary --output_dir mbart50_bengali-summarization --per_device_train_batch_size=1 --per_device_eval_batch_size=2 --overwrite_output_dir true --source_prefix "summarize: " --predict_with_generate yes
My dataset in json below format: I am doing it for bengali language: {“text”: “I’m sitting here in a boring room. It’s just another rainy Sunday afternoon. I’m wasting my time I got nothing to do. I’m hanging around I’m waiting for you. But nothing ever happens. And I wonder”, “summary”: “I’m sitting in a room where I’m waiting for something to happen”} Error:
File “/home/aniruddha/anaconda3/envs/mbart/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py”, line 295, in convert_ids_to_tokens index = int(index) TypeError: int() argument must be a string, a bytes-like object or a number, not ‘NoneType’
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (1 by maintainers)
Top GitHub Comments
Hi @Aniruddha-JU
Right now the
run_summarization.py
does not support fine-tuning mBART for summarization, we need to set the proper language tokens for mBART50. For now, you could easily modify the script to adapt it for mBART50 by setting the correct language tokens, as is done in the translation example.https://github.com/huggingface/transformers/blob/master/examples/pytorch/translation/run_translation.py#L340-L380
The difference here would be that the source and target language will be similar.
Also, could you please post the full stack trace the error seems unrelated to mBART.
with self.tokenizer.as_target_tokenizer(): File “/home/rahulpal/anaconda3/envs/rebel/lib/python3.7/contextlib.py”, line 112, in enter return next(self.gen) File “/home/rahulpal/anaconda3/envs/rebel/lib/python3.7/site-packages/transformers/models/mbart50/tokenization_mbart50_fast.py”, line 215, in as_target_tokenizer self.set_tgt_lang_special_tokens(self.tgt_lang) File “/home/rahulpal/anaconda3/envs/rebel/lib/python3.7/site-packages/transformers/models/mbart50/tokenization_mbart50_fast.py”, line 240, in set_tgt_lang_special_tokens prefix_tokens_str = self.convert_ids_to_tokens(self.prefix_tokens) File “/home/rahulpal/anaconda3/envs/rebel/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py”, line 307, in convert_ids_to_tokens index = int(index) TypeError: int() argument must be a string, a bytes-like object or a number, not ‘NoneType’