question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training script "run_mlm.py" doesn't work for certain datasets

See original GitHub issue

Environment info

  • transformers version:
  • Platform:3.4.0
  • Python version:3.8.3
  • PyTorch version (GPU?):3.6.0
  • Tensorflow version (GPU?):
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Yes

Who can help

@LysandreJik

Information

Model I am using (Bert, XLNet …):

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below) MLM

To reproduce

Steps to reproduce the behavior: Basically run the training script in examples/run_mlm.py using wikipedia dataset

python run_mlm.py \ 
--model_name_or_path roberta-base \
--dataset_name wikipedia \
--dataset_config_name 20200501.en \
--do_train \
--output_dir /tmp/test-mlm \

Erorr Message

Traceback (most recent call last):
  File "run_mlm2.py", line 388, in <module>
    main()
  File "run_mlm2.py", line 333, in main
    tokenized_datasets = tokenized_datasets.map(
  File "/home/zeyuy/miniconda3/lib/python3.8/site-packages/datasets/dataset_dict.py", line 283, in map
    {
  File "/home/zeyuy/miniconda3/lib/python3.8/site-packages/datasets/dataset_dict.py", line 284, in <dictcomp>
    k: dataset.map(
  File "/home/zeyuy/miniconda3/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1236, in map
    update_data = does_function_return_dict(test_inputs, test_indices)
  File "/home/zeyuy/miniconda3/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1207, in does_function_return_dict    function(*fn_args, indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
  File "run_mlm2.py", line 315, in group_texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
  File "run_mlm2.py", line 315, in <dictcomp>
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
TypeError: can only concatenate list (not "str") to list

Expected behavior

The training script should work for all datasets in huggingface datasets. The problem is that the feature other than ‘text’ (or the feature we train on) interferes when we try to concatenate the tokenized ‘text’ (‘input_ids’, ‘mask_ids’ …) from each instances.

A quick fix would be on line 295, change

remove_columns=[text_column_name], to remove_columns=column_names,

Should I open a PR or someone want to do a quick fix?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
thomwolfcommented, Nov 12, 2020

Hi newbie questions are better suited for the forum at https://discuss.huggingface.co

We try to keep the issues for bug reports and features/model requests.

1reaction
sguggercommented, Nov 11, 2020

Again, this is exactly what the CI does, so those failures are linked to your particular environment. Since you didn’t tell us what it is we can’t reproduce and fix potential issues. One way to make sure you don’t use your GPU if it’s busy is to run CUDA_VISIBLE_DEVICES='' make tests.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to train from scratch with run_mlm.py, .txt file? - Beginners
Hello! Essentially what I want to do is: point the code at a .txt file, and get a trained model out. How can...
Read more >
Learning rate not set in run_mlm.py? - Stack Overflow
I want to run (or resume) the run_mlm.py script with a specific learning rate, but it doesn't seem like setting it in the...
Read more >
How to turn your local (zip) data into a Huggingface Dataset
To load any of these datasets in your current python script or jupyter notebook, simply pass the name of the dataset to load_dataset()...
Read more >
Fine-tune GPT with Line-by-Line Dataset - Finisky Garden
There are three scripts: run_clm.py , run_mlm.py and run_plm.py . ... However, run_clm.py doesn't support line by line dataset.
Read more >
Pre-Training BERT with Hugging Face Transformers and ...
This will then start our AWS EC2 DL1 instance and run our run_mlm.py script on it using the huggingface/optimum-habana:latest container. from ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found