[run_clm] handling large inputs
See original GitHub issueThis could probably use a bit of sorting out:
[WARNING|tokenization_utils_base.py:3138] 2021-04-06 21:29:29,790 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1462828 > 1024). Running this sequence through the model will result in indexing errors
This doesn’t look right and it looks quite scary. 1462828 > 1024
This happens when someone feeds a huge input per entry, e.g. this happened when feeding
BS=1; rm -rf output_dir; PYTHONPATH=src USE_TF=0 examples/language-modeling/run_clm.py \
--model_name_or_path distilgpt2 --do_train --output_dir output_dir --num_train_epochs 1 \
--per_device_train_batch_size 1 --block_size 128 --train_file finetune-gpt2xl/train.csv
It comes from: https://github.com/Xirider/finetune-gpt2xl - which shows how to train gpt-neo. The csv file’s only record is a small book https://raw.githubusercontent.com/Xirider/finetune-gpt2xl/main/train.csv.
So the whole single input is 1462828 tokens.
Either run_clm.py
should slice it up or truncate it or do something about it, so that that warning won’t show up. I’m not sure what the design is supposed to be for handling huge inputs.
Here is the trace to where the warning comes from:
File "examples/language-modeling/run_clm.py", line 444, in <module>
main()
File "examples/language-modeling/run_clm.py", line 322, in main
tokenized_datasets = datasets.map(
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/datasets/dataset_dict.py", line 431, in map
{
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/datasets/dataset_dict.py", line 432, in <dictcomp>
k: dataset.map(
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1289, in map
update_data = does_function_return_dict(test_inputs, test_indices)
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1260, in does_function_return_dict
function(*fn_args, indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
File "examples/language-modeling/run_clm.py", line 320, in tokenize_function
return tokenizer(examples[text_column_name])
File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/tokenization_utils_base.py", line 2254, in __call__
return self.batch_encode_plus(
File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/tokenization_utils_base.py", line 2439, in batch_encode_plus
return self._batch_encode_plus(
File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/tokenization_utils.py", line 549, in _batch_encode_plus
batch_outputs = self._batch_prepare_for_model(
File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/tokenization_utils.py", line 597, in _batch_prepare_for_model
outputs = self.prepare_for_model(
File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/tokenization_utils_base.py", line 2790, in prepare_for_model
self._eventual_warn_about_too_long_sequence(encoded_inputs["input_ids"], max_length, verbose)
File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/tokenization_utils_base.py", line 3135, in _eventual_warn_about_too_long_sequence
traceback.print_stack()
Thank you!
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (9 by maintainers)
Top GitHub Comments
Yeah, it’s not ideal to import from
testing_utils
but I don’t see any import that is not part of the standard lib there so I don’t think it adds any dependence.For the rest, I like your solution, and I don’t think it’s a problem if the warnings don’t come from the same file. It’s very clear to me.
Maybe the warning should be rephrased, but it should still be there in my opinion since in 95% of the cases, the user will be using the tokenzier outputs directly inside the model, this is really an uncommon use case here. So it says something not necessarily warranted in those 5% but will help the rest of the time, I think it’s a good trade-off.
Catching the warning in the script sounds like a good idea, to not scare a beginner.