Caching processed dataset at wrong folder
See original GitHub issueHi guys, I run this on my Colab (PRO):
from datasets import load_dataset
dataset = load_dataset('text', data_files='/content/corpus.txt', cache_dir='/content/drive/My Drive', split='train')
def encode(examples):
return tokenizer(examples['text'], truncation=True, padding='max_length')
dataset = dataset.map(encode, batched=True)
The file is about 4 GB, so I cannot process it on the Colab HD because there is no enough space. So I decided to mount my Google Drive fs and do it on it.
The dataset is cached in the right place but by processing it (applying encode
function) seems to use a different folder because Colab HD starts to grow and it crashes when it should be done in the Drive fs.
What gets me crazy, it prints it is processing/encoding the dataset in the right folder:
Testing the mapped function outputs
Testing finished, running the mapping function on the dataset
Caching processed dataset at /content/drive/My Drive/text/default-ad3e69d6242ee916/0.0.0/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7/cache-b16341780a59747d.arrow
Issue Analytics
- State:
- Created 3 years ago
- Comments:13 (13 by maintainers)
Top Results From Across the Web
Cache management — datasets 1.12.0 documentation
The cache allows Datasets to avoid re-downloading or processing the entire dataset every time you use it. This guide will show you how...
Read more >DGLDataset — DGL 0.9.1post1 documentation
Check whether there is a dataset cache on disk (already processed and stored on the disk) by invoking has_cache() . If true, goto...
Read more >Troubleshooting | Data Version Control - DVC
Failed to pull data from the cloud · Too many open files error · Unable to find credentials · Unable to connect ·...
Read more >What is Caching and How it Works - Amazon AWS
In a distributed caching environment, the data can span multiple cache servers and be stored in a central location for the benefit of...
Read more >Caching guidance - Azure Architecture Center | Microsoft Learn
The underlying infrastructure determines the location of the cached data in the ... or the wrong number of parameters) then Redis refuses to...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
It looks like replacing
pa.OSFile
byopen
fixes it, I’m going to open a PRThanks for reporting ! It uses a temporary file to write the data. However it looks like the temporary file is not placed in the right directory during the processing