Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[run_clm] tokenize_function clarification makes it non-hashable => no-reusing cache

See original GitHub issue

Environment info

transformers version: master at commit acc851e1ff92835d2a3ee9774d9d0abfda6e3f36 (from yesterday)
Platform:
Python version:
PyTorch version (GPU?):
Tensorflow version (GPU?):
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help

@stas00 since you opened the PR #11145

Information

Model I am using (Bert, XLNet …):

The problem arises when using:

[x ] the official example scripts: (give details below)
my own modified scripts: (give details below)

To reproduce

I am running the minimal command:

CUDA_VISIBLE_DEVICES=0 python examples/language-modeling/run_clm.py \
    --model_name_or_path gpt2 \
    --dataset_name ./data/bk --block_size 1024 \
    --do_train \
    --output_dir debug --overwrite_output_dir \
    --preprocessing_num_workers 5

When it gets to line 331, datasets.map gives this warning:

[WARNING|tokenization_utils_base.py:3143] 2021-04-09 15:48:53,408 >> Token indices sequence length is longer than the specified maximum sequence length for this model (191443 > 1024). Running this sequence through the model will result in indexing errors [WARNING|run_clm.py:333] 2021-04-09 15:48:53,408 >> ^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits before being passed to the model. 04/09/2021 15:48:53 - WARNING - 17900 - datasets.fingerprint - Parameter ‘function’=<function tokenize_function at 0x7f747662c268> of the transform datasets.arrow_dataset._map_single couldn’t be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won’t be showed.

Basically, something went wrong when trying to hash the tokenize_function (to produce the cache file name) => it doesn’t use the pre-processed cache for the next launch.

The tokenize_function was originally

    def tokenize_function(examples):
        output = tokenizer(examples[text_column_name])
        return output

and became:

    def tokenize_function(examples):
        tok_logger = transformers.utils.logging.get_logger("transformers.tokenization_utils_base")
        with CaptureLogger(tok_logger) as cl:
            output = tokenizer(examples[text_column_name])
        # clm input could be much much longer than block_size
        if "Token indices sequence length is longer than the" in cl.out:
            tok_logger.warning(
                "^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits before being passed to the model."
            )
        return output

Issue Analytics

State:
Created 2 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

lancekungcommented, Jul 7, 2021

Since we can’t see your custom code, it’s hard to tell why you have a problem. At least checking your traceback it doesn’t match the current run_clm.py version in master. Perhaps you are running an unmodified code that still has the original problem?

Perhaps give a try to run_clm.py in master?

If it doesn’t work, please open a new Issue and give us all the required details to be able to reproduce the problem. And tag me to it. Thank you.

I tried the run_clm.py in master, but it still doesn’t work. I will create a new issue. Thanks for your reply!

1reaction

VictorSanhcommented, Apr 9, 2021

you rock!