question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cached dataset not loaded

See original GitHub issue

Describe the bug

I have a large dataset (common_voice, english) where I use several map and filter functions. Sometimes my cached datasets after specific functions are not loaded. I always use the same arguments, same functions, no seed…

Steps to reproduce the bug

def filter_by_duration(batch):
    return (
        batch["duration"] <= 10
        and batch["duration"] >= 1
        and len(batch["target_text"]) > 5
    )

def prepare_dataset(batch):
    batch["input_values"] = processor(
        batch["speech"], sampling_rate=batch["sampling_rate"][0]
    ).input_values
    with processor.as_target_processor():
        batch["labels"] = processor(batch["target_text"]).input_ids
    return batch

train_dataset = train_dataset.filter(
    filter_by_duration,
    remove_columns=["duration"],
    num_proc=data_args.preprocessing_num_workers,
)

# PROBLEM HERE -> below function is reexecuted and cache is not loaded
train_dataset = train_dataset.map(
    prepare_dataset,
    remove_columns=train_dataset.column_names,
    batch_size=training_args.per_device_train_batch_size,
    batched=True,
    num_proc=data_args.preprocessing_num_workers,
)

# Later in script
set_caching_enabled(False)
# apply map on trained model to eval/test sets

Expected results

The cached dataset should always be reloaded.

Actual results

The function is reexecuted.

I have access to cached files cache-xxxxx.arrow. Is there a way I can somehow load manually 2 versions and see how the hash was created for debug purposes (to know if it’s an issue with dataset or function)?

Environment info

  • datasets version: 1.6.2
  • Platform: Linux-5.8.0-45-generic-x86_64-with-glibc2.29
  • Python version: 3.8.5
  • PyTorch version (GPU?): 1.8.1+cu102 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
lhoestqcommented, Jun 2, 2021

does it consider just the name or the actual code of the function

It looks at the name and the actual code and all variables such as recursively. It uses dill to do so, which is based on pickle. Basically the hash is computed using the pickle bytes of your function (computed using dill to support most python objects).

does it consider variables that are not passed explicitly as parameters to the functions (such as the processor here)

Yes it does thanks to recursive pickling.

0reactions
borisdaymacommented, Jun 2, 2021

Thanks for these explanations. I’m closing the issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

The datasets.map function does not load cached dataset
map function does not load cached dataset. I am using the run_mlm.py provided in the transformers repository to pretrain bert. The dataset is...
Read more >
Load to Cache not loading data from Add to cache component.
Current Status - I am able to feed the data into 2 Add to cache component but while adding data records into the...
Read more >
Report is not using cached dataset like it is supposed to
The dataset is very large, so I created a cache for it so it doesn't take an eternity to load every time the...
Read more >
Use storage mode in Power BI Desktop - Microsoft Learn
Use storage mode to control whether data is cached in-memory for reports in ... might benefit from not being cached, to reduce data...
Read more >
5 Loading Data Into a Cache - Oracle Help Center
Create a Class with the Key for the Domain Objects · Edit the POF Configuration File · Create the Data Generator · Create...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found