Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cached dataset not loaded

See original GitHub issue

Describe the bug

I have a large dataset (common_voice, english) where I use several map and filter functions. Sometimes my cached datasets after specific functions are not loaded. I always use the same arguments, same functions, no seed…

Steps to reproduce the bug

def filter_by_duration(batch):
    return (
        batch["duration"] <= 10
        and batch["duration"] >= 1
        and len(batch["target_text"]) > 5
    )

def prepare_dataset(batch):
    batch["input_values"] = processor(
        batch["speech"], sampling_rate=batch["sampling_rate"][0]
    ).input_values
    with processor.as_target_processor():
        batch["labels"] = processor(batch["target_text"]).input_ids
    return batch

train_dataset = train_dataset.filter(
    filter_by_duration,
    remove_columns=["duration"],
    num_proc=data_args.preprocessing_num_workers,
)

# PROBLEM HERE -> below function is reexecuted and cache is not loaded
train_dataset = train_dataset.map(
    prepare_dataset,
    remove_columns=train_dataset.column_names,
    batch_size=training_args.per_device_train_batch_size,
    batched=True,
    num_proc=data_args.preprocessing_num_workers,
)

# Later in script
set_caching_enabled(False)
# apply map on trained model to eval/test sets

Expected results

The cached dataset should always be reloaded.

Actual results

The function is reexecuted.

I have access to cached files cache-xxxxx.arrow. Is there a way I can somehow load manually 2 versions and see how the hash was created for debug purposes (to know if it’s an issue with dataset or function)?

Environment info

datasets version: 1.6.2
Platform: Linux-5.8.0-45-generic-x86_64-with-glibc2.29
Python version: 3.8.5
PyTorch version (GPU?): 1.8.1+cu102 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Issue Analytics

State:
Created 2 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

lhoestqcommented, Jun 2, 2021

does it consider just the name or the actual code of the function

It looks at the name and the actual code and all variables such as recursively. It uses dill to do so, which is based on pickle. Basically the hash is computed using the pickle bytes of your function (computed using dill to support most python objects).