Training script "run_mlm.py" doesn't work for certain datasets
See original GitHub issueEnvironment info
transformers
version:- Platform:3.4.0
- Python version:3.8.3
- PyTorch version (GPU?):3.6.0
- Tensorflow version (GPU?):
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes
Who can help
Information
Model I am using (Bert, XLNet …):
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below) MLM
To reproduce
Steps to reproduce the behavior: Basically run the training script in examples/run_mlm.py using wikipedia dataset
python run_mlm.py \
--model_name_or_path roberta-base \
--dataset_name wikipedia \
--dataset_config_name 20200501.en \
--do_train \
--output_dir /tmp/test-mlm \
Erorr Message
Traceback (most recent call last):
File "run_mlm2.py", line 388, in <module>
main()
File "run_mlm2.py", line 333, in main
tokenized_datasets = tokenized_datasets.map(
File "/home/zeyuy/miniconda3/lib/python3.8/site-packages/datasets/dataset_dict.py", line 283, in map
{
File "/home/zeyuy/miniconda3/lib/python3.8/site-packages/datasets/dataset_dict.py", line 284, in <dictcomp>
k: dataset.map(
File "/home/zeyuy/miniconda3/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1236, in map
update_data = does_function_return_dict(test_inputs, test_indices)
File "/home/zeyuy/miniconda3/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1207, in does_function_return_dict function(*fn_args, indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
File "run_mlm2.py", line 315, in group_texts
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
File "run_mlm2.py", line 315, in <dictcomp>
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
TypeError: can only concatenate list (not "str") to list
Expected behavior
The training script should work for all datasets in huggingface datasets. The problem is that the feature other than ‘text’ (or the feature we train on) interferes when we try to concatenate the tokenized ‘text’ (‘input_ids’, ‘mask_ids’ …) from each instances.
A quick fix would be on line 295, change
remove_columns=[text_column_name],
to
remove_columns=column_names,
Should I open a PR or someone want to do a quick fix?
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:8 (8 by maintainers)
Top Results From Across the Web
How to train from scratch with run_mlm.py, .txt file? - Beginners
Hello! Essentially what I want to do is: point the code at a .txt file, and get a trained model out. How can...
Read more >Learning rate not set in run_mlm.py? - Stack Overflow
I want to run (or resume) the run_mlm.py script with a specific learning rate, but it doesn't seem like setting it in the...
Read more >How to turn your local (zip) data into a Huggingface Dataset
To load any of these datasets in your current python script or jupyter notebook, simply pass the name of the dataset to load_dataset()...
Read more >Fine-tune GPT with Line-by-Line Dataset - Finisky Garden
There are three scripts: run_clm.py , run_mlm.py and run_plm.py . ... However, run_clm.py doesn't support line by line dataset.
Read more >Pre-Training BERT with Hugging Face Transformers and ...
This will then start our AWS EC2 DL1 instance and run our run_mlm.py script on it using the huggingface/optimum-habana:latest container. from ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi newbie questions are better suited for the forum at https://discuss.huggingface.co
We try to keep the issues for bug reports and features/model requests.
Again, this is exactly what the CI does, so those failures are linked to your particular environment. Since you didn’t tell us what it is we can’t reproduce and fix potential issues. One way to make sure you don’t use your GPU if it’s busy is to run
CUDA_VISIBLE_DEVICES='' make tests
.