[run_clm] tokenize_function clarification makes it non-hashable => no-reusing cache
See original GitHub issueEnvironment info
transformers
version: master at commit acc851e1ff92835d2a3ee9774d9d0abfda6e3f36 (from yesterday)- Platform:
- Python version:
- PyTorch version (GPU?):
- Tensorflow version (GPU?):
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help
@stas00 since you opened the PR #11145
Information
Model I am using (Bert, XLNet …):
The problem arises when using:
- [x ] the official example scripts: (give details below)
- my own modified scripts: (give details below)
To reproduce
I am running the minimal command:
CUDA_VISIBLE_DEVICES=0 python examples/language-modeling/run_clm.py \
--model_name_or_path gpt2 \
--dataset_name ./data/bk --block_size 1024 \
--do_train \
--output_dir debug --overwrite_output_dir \
--preprocessing_num_workers 5
When it gets to line 331, datasets.map gives this warning:
[WARNING|tokenization_utils_base.py:3143] 2021-04-09 15:48:53,408 >> Token indices sequence length is longer than the specified maximum sequence length for this model (191443 > 1024). Running this sequence through the model will result in indexing errors [WARNING|run_clm.py:333] 2021-04-09 15:48:53,408 >> ^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits before being passed to the model. 04/09/2021 15:48:53 - WARNING - 17900 - datasets.fingerprint - Parameter ‘function’=<function tokenize_function at 0x7f747662c268> of the transform datasets.arrow_dataset._map_single couldn’t be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won’t be showed.
Basically, something went wrong when trying to hash the tokenize_function
(to produce the cache file name) => it doesn’t use the pre-processed cache for the next launch.
The tokenize_function
was originally
def tokenize_function(examples):
output = tokenizer(examples[text_column_name])
return output
and became:
def tokenize_function(examples):
tok_logger = transformers.utils.logging.get_logger("transformers.tokenization_utils_base")
with CaptureLogger(tok_logger) as cl:
output = tokenizer(examples[text_column_name])
# clm input could be much much longer than block_size
if "Token indices sequence length is longer than the" in cl.out:
tok_logger.warning(
"^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits before being passed to the model."
)
return output
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
I tried the
run_clm.py
in master, but it still doesn’t work. I will create a new issue. Thanks for your reply!you rock!