[examples] run_clm re-processes dataset on every run
See original GitHub issuedeveloping with run_clm
is difficult since its startup is very slow - it rebuilds the dataset on each start.
@VictorSanh says it started to do that recently…
I think it’s because it has to chunk the existing dataset into smaller pieces, it’s a slow start everytime and it doesn’t save these results.
So the original dataset has already been preprocessed, but it’s not good enough for run_clm.py
.
So I’m thinking perhaps for dev needs we need a dataset with short <512 entries? and then it could use it w/o additional preprocessing?
But I could be wrong I haven’t investigated the reason for the slow start.
to reproduce:
USE_TF=0 python examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path gpt2 \
--dataset_name "stas/openwebtext-10k" \
--output_dir output_dir \
--overwrite_output_dir \
--do_train \
--do_eval \
--max_train_samples 1000 \
--max_eval_samples 200 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--num_train_epochs 1 \
--warmup_steps 8 \
--block_size 64 \
--fp16 \
--report_to none
So look at the tqdm bars before training starts to see the symptom. And this is already a very truncated dataset.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
Running out of Memory with run_clm.py - Hugging Face Forums
I have already successfully fine-tuned a GPT2 model and I currently want to fine-tune a GPT2-Large model from the same 1.4 GB training...
Read more >Loading a custom dataset - YouTube
This video is part of the Hugging Face course: http://huggingface.co/course Open in colab to run the code samples : ...
Read more >Writing Distributed Applications with PyTorch
def run(rank, size): tensor = torch.zeros(1) if rank == 0: tensor += 1 # Send ... For example, in order to obtain the...
Read more >Distributed Computing with PyTorch - Shiv Gehlot
For example, if we have two nodes servers with four GPUs each, ... and each GPU uses an exclusive chunk of the dataset...
Read more >How to run an end to end example of distributed data parallel ...
wrap the optimizer in any distributed trainer (like cherry? cherry is a pytorch lib for things like this); do we do the usual...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
OK, moved this to
datasets
https://github.com/huggingface/datasets/issues/2387The dataset caching is all relying on the datasets library, so the issue should probably be tracked here. Especially if this is a new change: since there was no change I’m aware of in
run_clm
recently it may be coming from a change there.