question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

datasets 1.6 ignores cache

See original GitHub issue

Moving from https://github.com/huggingface/transformers/issues/11801#issuecomment-845546612

Quoting @VictorSanh:

I downgraded datasets to 1.5.0 and printed tokenized_datasets.cache_files (L335):

{'train': [{'filename': '/home/victor/.cache/huggingface/datasets/openwebtext10k/plain_text/1.0.0/3a8df094c671b4cb63ed0b41f40fb3bd855e9ce2e3765e5df50abcdfb5ec144b/cache-c6aefe81ca4e5152.arrow'}], 'validation': [{'filename': '/home/victor/.cache/huggingface/datasets/openwebtext10k/plain_text/1.0.0/3a8df094c671b4cb63ed0b41f40fb3bd855e9ce2e3765e5df50abcdfb5ec144b/cache-97cf4c813e6469c6.arrow'}]}

while the same command with the latest version of datasets (actually starting at 1.6.0) gives:

{'train': [], 'validation': []}

I also confirm that downgrading to datasets==1.5.0 makes things fast again - i.e. cache is used.

to reproduce:

USE_TF=0 python  examples/pytorch/language-modeling/run_clm.py \
    --model_name_or_path gpt2 \
    --dataset_name "stas/openwebtext-10k" \
    --output_dir output_dir \
    --overwrite_output_dir \
    --do_train \
    --do_eval \
    --max_train_samples 1000 \
    --max_eval_samples 200 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --num_train_epochs 1 \
    --warmup_steps 8 \
    --block_size 64 \
    --fp16 \
    --report_to none

the first time the startup is slow and some 5 tqdm bars. It shouldn’t do it on consequent runs. but with datasets>1.5.0 it rebuilds on every run.

@lhoestq

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
albertvillanovacommented, May 24, 2021

Great! Thanks, @stas00.

I am implementing your suggestion to turn off default value when set to 0.

For the other suggestion (allowing different metric prefixes), I will discuss with @lhoestq to agree on its implementation.

1reaction
stas00commented, May 24, 2021

That’s a good question, and again the normal bytes can be used for that:

MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES=1e12 # (~2**40)

Since it’s unlikely that anybody will have more than 1TB RAM.

It’s also silly that it uses BYTES and not MBYTES - that level of refinement doesn’t seem to be of a practical use in this context.

Not sure when it was added and if there are back-compat issues here, but perhaps it could be renamed MAX_IN_MEMORY_DATASET_SIZE and support 1M, 1G, 1T, etc.

But scientific notation is quite intuitive too, as each 000 zeros is the next M, G, T multiplier. Minus the discrepancy of 1024 vs 1000, which adds up. And it is easy to write down 1e12, as compared to 1099511627776 (2**40). (1.1e12 is more exact).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Using a Metric — datasets 1.6.1 documentation - Hugging Face
Metric.compute() then gathers all the cached predictions and references to compute the ... the floor to use force: Ignore data that looks already...
Read more >
Using caching in Shiny to maximize performance
In Shiny 1.6.0, reactive s and render functions can cache their values, using the bindCache() function. In other words, the application ...
Read more >
Caching queries | Looker - Google Cloud
When a SQL query is run from an Explore, a Look, or a dashboard, Looker checks the cache to see if there are...
Read more >
Cache Vs Persist in spark - Stack Overflow
Persist this Dataset with the default storage level (`MEMORY_AND_DISK`). * * @group basic * @since 1.6.0 */ def cache(): this.type = persist ...
Read more >
Migration Guide: SQL, Datasets and DataFrame - Apache Spark
In Spark 3.2, table refreshing clears cached data of the table as well as of ... In Spark 3.1 and earlier, it used...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found