Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[run_clm] handling large inputs

See original GitHub issue

This could probably use a bit of sorting out:

[WARNING|tokenization_utils_base.py:3138] 2021-04-06 21:29:29,790 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1462828 > 1024). Running this sequence through the model will result in indexing errors

This doesn’t look right and it looks quite scary. 1462828 > 1024

This happens when someone feeds a huge input per entry, e.g. this happened when feeding

BS=1; rm -rf output_dir; PYTHONPATH=src USE_TF=0 examples/language-modeling/run_clm.py \
--model_name_or_path distilgpt2 --do_train --output_dir output_dir --num_train_epochs 1  \
--per_device_train_batch_size 1 --block_size 128 --train_file finetune-gpt2xl/train.csv

It comes from: https://github.com/Xirider/finetune-gpt2xl - which shows how to train gpt-neo. The csv file’s only record is a small book https://raw.githubusercontent.com/Xirider/finetune-gpt2xl/main/train.csv.

So the whole single input is 1462828 tokens.

Either run_clm.py should slice it up or truncate it or do something about it, so that that warning won’t show up. I’m not sure what the design is supposed to be for handling huge inputs.

Here is the trace to where the warning comes from:

  File "examples/language-modeling/run_clm.py", line 444, in <module>
    main()
  File "examples/language-modeling/run_clm.py", line 322, in main
    tokenized_datasets = datasets.map(
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/datasets/dataset_dict.py", line 431, in map
    {
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/datasets/dataset_dict.py", line 432, in <dictcomp>
    k: dataset.map(
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1289, in map
    update_data = does_function_return_dict(test_inputs, test_indices)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1260, in does_function_return_dict
    function(*fn_args, indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
  File "examples/language-modeling/run_clm.py", line 320, in tokenize_function
    return tokenizer(examples[text_column_name])
  File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/tokenization_utils_base.py", line 2254, in __call__
    return self.batch_encode_plus(
  File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/tokenization_utils_base.py", line 2439, in batch_encode_plus
    return self._batch_encode_plus(
  File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/tokenization_utils.py", line 549, in _batch_encode_plus
    batch_outputs = self._batch_prepare_for_model(
  File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/tokenization_utils.py", line 597, in _batch_prepare_for_model
    outputs = self.prepare_for_model(
  File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/tokenization_utils_base.py", line 2790, in prepare_for_model
    self._eventual_warn_about_too_long_sequence(encoded_inputs["input_ids"], max_length, verbose)
  File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/tokenization_utils_base.py", line 3135, in _eventual_warn_about_too_long_sequence
    traceback.print_stack()

Thank you!

@sgugger

Issue Analytics

State:
Created 2 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

sguggercommented, Apr 8, 2021

Yeah, it’s not ideal to import from testing_utils but I don’t see any import that is not part of the standard lib there so I don’t think it adds any dependence.

For the rest, I like your solution, and I don’t think it’s a problem if the warnings don’t come from the same file. It’s very clear to me.

1reaction

sguggercommented, Apr 8, 2021

Maybe the warning should be rephrased, but it should still be there in my opinion since in 95% of the cases, the user will be using the tokenzier outputs directly inside the model, this is really an uncommon use case here. So it says something not necessarily warranted in those 5% but will help the rest of the time, I think it’s a good trade-off.

Catching the warning in the script sounds like a good idea, to not scare a beginner.

Top Results From Across the Web

transformers/run_clm.py at main · huggingface ... - GitHub

# Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size. def group_texts(examples):. # Concatenate ...

CUDA error when handling large input - c++ - Stack Overflow

I am really sorry. I am a real noob when it comes to CUDA. The line of the error is actually not in...

Using interpinic to interpolate initial conditions to different ...

"interpinic" is used to interpolate initial conditions from one resolution to another. In order to do the interpolation you must first run CLM...

problem with large inputs - general - CodeChef Discuss

guys help me out here ,whenever there is an input that is order of 10^6 or 10^9 ... and then converted them into...

https://heasarc.gsfc.nasa.gov/ftools/caldb/help/ma...

... a list of input files which may contain Good_Xenon Mode data files and/or ... an arbitrary large file list from XDF and...