Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[examples] run_clm re-processes dataset on every run

See original GitHub issue

developing with run_clm is difficult since its startup is very slow - it rebuilds the dataset on each start.

@VictorSanh says it started to do that recently…

I think it’s because it has to chunk the existing dataset into smaller pieces, it’s a slow start everytime and it doesn’t save these results. So the original dataset has already been preprocessed, but it’s not good enough for run_clm.py.

So I’m thinking perhaps for dev needs we need a dataset with short <512 entries? and then it could use it w/o additional preprocessing?

But I could be wrong I haven’t investigated the reason for the slow start.

to reproduce:

USE_TF=0 python  examples/pytorch/language-modeling/run_clm.py \
    --model_name_or_path gpt2 \
    --dataset_name "stas/openwebtext-10k" \
    --output_dir output_dir \
    --overwrite_output_dir \
    --do_train \
    --do_eval \
    --max_train_samples 1000 \
    --max_eval_samples 200 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --num_train_epochs 1 \
    --warmup_steps 8 \
    --block_size 64 \
    --fp16 \
    --report_to none

So look at the tqdm bars before training starts to see the symptom. And this is already a very truncated dataset.

@VictorSanh, @sgugger

Issue Analytics

State:
Created 2 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

stas00commented, May 21, 2021

OK, moved this to datasets https://github.com/huggingface/datasets/issues/2387

1reaction

sguggercommented, May 20, 2021

The dataset caching is all relying on the datasets library, so the issue should probably be tracked here. Especially if this is a new change: since there was no change I’m aware of in run_clm recently it may be coming from a change there.