Cached dataset not loaded
See original GitHub issueDescribe the bug
I have a large dataset (common_voice, english) where I use several map and filter functions. Sometimes my cached datasets after specific functions are not loaded. I always use the same arguments, same functions, no seed…
Steps to reproduce the bug
def filter_by_duration(batch):
return (
batch["duration"] <= 10
and batch["duration"] >= 1
and len(batch["target_text"]) > 5
)
def prepare_dataset(batch):
batch["input_values"] = processor(
batch["speech"], sampling_rate=batch["sampling_rate"][0]
).input_values
with processor.as_target_processor():
batch["labels"] = processor(batch["target_text"]).input_ids
return batch
train_dataset = train_dataset.filter(
filter_by_duration,
remove_columns=["duration"],
num_proc=data_args.preprocessing_num_workers,
)
# PROBLEM HERE -> below function is reexecuted and cache is not loaded
train_dataset = train_dataset.map(
prepare_dataset,
remove_columns=train_dataset.column_names,
batch_size=training_args.per_device_train_batch_size,
batched=True,
num_proc=data_args.preprocessing_num_workers,
)
# Later in script
set_caching_enabled(False)
# apply map on trained model to eval/test sets
Expected results
The cached dataset should always be reloaded.
Actual results
The function is reexecuted.
I have access to cached files cache-xxxxx.arrow
.
Is there a way I can somehow load manually 2 versions and see how the hash was created for debug purposes (to know if it’s an issue with dataset or function)?
Environment info
datasets
version: 1.6.2- Platform: Linux-5.8.0-45-generic-x86_64-with-glibc2.29
- Python version: 3.8.5
- PyTorch version (GPU?): 1.8.1+cu102 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
The datasets.map function does not load cached dataset
map function does not load cached dataset. I am using the run_mlm.py provided in the transformers repository to pretrain bert. The dataset is...
Read more >Load to Cache not loading data from Add to cache component.
Current Status - I am able to feed the data into 2 Add to cache component but while adding data records into the...
Read more >Report is not using cached dataset like it is supposed to
The dataset is very large, so I created a cache for it so it doesn't take an eternity to load every time the...
Read more >Use storage mode in Power BI Desktop - Microsoft Learn
Use storage mode to control whether data is cached in-memory for reports in ... might benefit from not being cached, to reduce data...
Read more >5 Loading Data Into a Cache - Oracle Help Center
Create a Class with the Key for the Domain Objects · Edit the POF Configuration File · Create the Data Generator · Create...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
It looks at the name and the actual code and all variables such as recursively. It uses
dill
to do so, which is based onpickle
. Basically the hash is computed using the pickle bytes of your function (computed usingdill
to support most python objects).Yes it does thanks to recursive pickling.
Thanks for these explanations. I’m closing the issue.