datasets 1.6 ignores cache
See original GitHub issueMoving from https://github.com/huggingface/transformers/issues/11801#issuecomment-845546612
Quoting @VictorSanh:
I downgraded datasets to
1.5.0
and printedtokenized_datasets.cache_files
(L335):
{'train': [{'filename': '/home/victor/.cache/huggingface/datasets/openwebtext10k/plain_text/1.0.0/3a8df094c671b4cb63ed0b41f40fb3bd855e9ce2e3765e5df50abcdfb5ec144b/cache-c6aefe81ca4e5152.arrow'}], 'validation': [{'filename': '/home/victor/.cache/huggingface/datasets/openwebtext10k/plain_text/1.0.0/3a8df094c671b4cb63ed0b41f40fb3bd855e9ce2e3765e5df50abcdfb5ec144b/cache-97cf4c813e6469c6.arrow'}]}
while the same command with the latest version of datasets (actually starting at
1.6.0
) gives:
{'train': [], 'validation': []}
I also confirm that downgrading to datasets==1.5.0
makes things fast again - i.e. cache is used.
to reproduce:
USE_TF=0 python examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path gpt2 \
--dataset_name "stas/openwebtext-10k" \
--output_dir output_dir \
--overwrite_output_dir \
--do_train \
--do_eval \
--max_train_samples 1000 \
--max_eval_samples 200 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--num_train_epochs 1 \
--warmup_steps 8 \
--block_size 64 \
--fp16 \
--report_to none
the first time the startup is slow and some 5 tqdm bars. It shouldn’t do it on consequent runs. but with datasets>1.5.0
it rebuilds on every run.
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (13 by maintainers)
Great! Thanks, @stas00.
I am implementing your suggestion to turn off default value when set to
0
.For the other suggestion (allowing different metric prefixes), I will discuss with @lhoestq to agree on its implementation.
That’s a good question, and again the normal bytes can be used for that:
Since it’s unlikely that anybody will have more than 1TB RAM.
It’s also silly that it uses BYTES and not MBYTES - that level of refinement doesn’t seem to be of a practical use in this context.
Not sure when it was added and if there are back-compat issues here, but perhaps it could be renamed
MAX_IN_MEMORY_DATASET_SIZE
and support 1M, 1G, 1T, etc.But scientific notation is quite intuitive too, as each 000 zeros is the next M, G, T multiplier. Minus the discrepancy of 1024 vs 1000, which adds up. And it is easy to write down
1e12
, as compared to1099511627776
(2**40). (1.1e12
is more exact).