question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Caching processed dataset at wrong folder

See original GitHub issue

Hi guys, I run this on my Colab (PRO):

from datasets import load_dataset
dataset = load_dataset('text', data_files='/content/corpus.txt', cache_dir='/content/drive/My Drive', split='train')

def encode(examples):
  return tokenizer(examples['text'], truncation=True, padding='max_length')

dataset = dataset.map(encode, batched=True)

The file is about 4 GB, so I cannot process it on the Colab HD because there is no enough space. So I decided to mount my Google Drive fs and do it on it. The dataset is cached in the right place but by processing it (applying encode function) seems to use a different folder because Colab HD starts to grow and it crashes when it should be done in the Drive fs.

What gets me crazy, it prints it is processing/encoding the dataset in the right folder:

Testing the mapped function outputs
Testing finished, running the mapping function on the dataset
Caching processed dataset at /content/drive/My Drive/text/default-ad3e69d6242ee916/0.0.0/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7/cache-b16341780a59747d.arrow

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
lhoestqcommented, Sep 22, 2020

It looks like replacing pa.OSFile by open fixes it, I’m going to open a PR

1reaction
lhoestqcommented, Sep 19, 2020

Thanks for reporting ! It uses a temporary file to write the data. However it looks like the temporary file is not placed in the right directory during the processing

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cache management — datasets 1.12.0 documentation
The cache allows Datasets to avoid re-downloading or processing the entire dataset every time you use it. This guide will show you how...
Read more >
DGLDataset — DGL 0.9.1post1 documentation
Check whether there is a dataset cache on disk (already processed and stored on the disk) by invoking has_cache() . If true, goto...
Read more >
Troubleshooting | Data Version Control - DVC
Failed to pull data from the cloud · Too many open files error · Unable to find credentials · Unable to connect ·...
Read more >
What is Caching and How it Works - Amazon AWS
In a distributed caching environment, the data can span multiple cache servers and be stored in a central location for the benefit of...
Read more >
Caching guidance - Azure Architecture Center | Microsoft Learn
The underlying infrastructure determines the location of the cached data in the ... or the wrong number of parameters) then Redis refuses to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found