Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training script "run_mlm.py" doesn't work for certain datasets

See original GitHub issue

Environment info

transformers version:
Platform:3.4.0
Python version:3.8.3
PyTorch version (GPU?):3.6.0
Tensorflow version (GPU?):
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

Who can help

@LysandreJik

Information

Model I am using (Bert, XLNet …):

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below) MLM

To reproduce

Steps to reproduce the behavior: Basically run the training script in examples/run_mlm.py using wikipedia dataset

python run_mlm.py \ 
--model_name_or_path roberta-base \
--dataset_name wikipedia \
--dataset_config_name 20200501.en \
--do_train \
--output_dir /tmp/test-mlm \

Erorr Message

Traceback (most recent call last):
  File "run_mlm2.py", line 388, in <module>
    main()
  File "run_mlm2.py", line 333, in main
    tokenized_datasets = tokenized_datasets.map(
  File "/home/zeyuy/miniconda3/lib/python3.8/site-packages/datasets/dataset_dict.py", line 283, in map
    {
  File "/home/zeyuy/miniconda3/lib/python3.8/site-packages/datasets/dataset_dict.py", line 284, in <dictcomp>
    k: dataset.map(
  File "/home/zeyuy/miniconda3/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1236, in map
    update_data = does_function_return_dict(test_inputs, test_indices)
  File "/home/zeyuy/miniconda3/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1207, in does_function_return_dict    function(*fn_args, indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
  File "run_mlm2.py", line 315, in group_texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
  File "run_mlm2.py", line 315, in <dictcomp>
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
TypeError: can only concatenate list (not "str") to list

Expected behavior

The training script should work for all datasets in huggingface datasets. The problem is that the feature other than ‘text’ (or the feature we train on) interferes when we try to concatenate the tokenized ‘text’ (‘input_ids’, ‘mask_ids’ …) from each instances.

A quick fix would be on line 295, change

remove_columns=[text_column_name], to remove_columns=column_names,

Should I open a PR or someone want to do a quick fix?

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

thomwolfcommented, Nov 12, 2020

Hi newbie questions are better suited for the forum at https://discuss.huggingface.co

We try to keep the issues for bug reports and features/model requests.

1reaction

sguggercommented, Nov 11, 2020

Again, this is exactly what the CI does, so those failures are linked to your particular environment. Since you didn’t tell us what it is we can’t reproduce and fix potential issues. One way to make sure you don’t use your GPU if it’s busy is to run CUDA_VISIBLE_DEVICES='' make tests.