Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

datasets 1.6 ignores cache

See original GitHub issue

Moving from https://github.com/huggingface/transformers/issues/11801#issuecomment-845546612

Quoting @VictorSanh:

I downgraded datasets to 1.5.0 and printed tokenized_datasets.cache_files (L335):

{'train': [{'filename': '/home/victor/.cache/huggingface/datasets/openwebtext10k/plain_text/1.0.0/3a8df094c671b4cb63ed0b41f40fb3bd855e9ce2e3765e5df50abcdfb5ec144b/cache-c6aefe81ca4e5152.arrow'}], 'validation': [{'filename': '/home/victor/.cache/huggingface/datasets/openwebtext10k/plain_text/1.0.0/3a8df094c671b4cb63ed0b41f40fb3bd855e9ce2e3765e5df50abcdfb5ec144b/cache-97cf4c813e6469c6.arrow'}]}

while the same command with the latest version of datasets (actually starting at 1.6.0) gives:

{'train': [], 'validation': []}

I also confirm that downgrading to datasets==1.5.0 makes things fast again - i.e. cache is used.

to reproduce:

USE_TF=0 python  examples/pytorch/language-modeling/run_clm.py \
    --model_name_or_path gpt2 \
    --dataset_name "stas/openwebtext-10k" \
    --output_dir output_dir \
    --overwrite_output_dir \
    --do_train \
    --do_eval \
    --max_train_samples 1000 \
    --max_eval_samples 200 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --num_train_epochs 1 \
    --warmup_steps 8 \
    --block_size 64 \
    --fp16 \
    --report_to none

the first time the startup is slow and some 5 tqdm bars. It shouldn’t do it on consequent runs. but with datasets>1.5.0 it rebuilds on every run.

@lhoestq

Issue Analytics

State:
Created 2 years ago
Comments:13 (13 by maintainers)

Top GitHub Comments

1reaction

albertvillanovacommented, May 24, 2021

Great! Thanks, @stas00.

I am implementing your suggestion to turn off default value when set to 0.

For the other suggestion (allowing different metric prefixes), I will discuss with @lhoestq to agree on its implementation.

1reaction

stas00commented, May 24, 2021

That’s a good question, and again the normal bytes can be used for that:

MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES=1e12 # (~2**40)

Since it’s unlikely that anybody will have more than 1TB RAM.

It’s also silly that it uses BYTES and not MBYTES - that level of refinement doesn’t seem to be of a practical use in this context.

Not sure when it was added and if there are back-compat issues here, but perhaps it could be renamed MAX_IN_MEMORY_DATASET_SIZE and support 1M, 1G, 1T, etc.

But scientific notation is quite intuitive too, as each 000 zeros is the next M, G, T multiplier. Minus the discrepancy of 1024 vs 1000, which adds up. And it is easy to write down 1e12, as compared to 1099511627776 (2**40). (1.1e12 is more exact).