question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[examples] run_clm re-processes dataset on every run

See original GitHub issue

developing with run_clm is difficult since its startup is very slow - it rebuilds the dataset on each start.

@VictorSanh says it started to do that recently…

I think it’s because it has to chunk the existing dataset into smaller pieces, it’s a slow start everytime and it doesn’t save these results. So the original dataset has already been preprocessed, but it’s not good enough for run_clm.py.

So I’m thinking perhaps for dev needs we need a dataset with short <512 entries? and then it could use it w/o additional preprocessing?

But I could be wrong I haven’t investigated the reason for the slow start.

to reproduce:

USE_TF=0 python  examples/pytorch/language-modeling/run_clm.py \
    --model_name_or_path gpt2 \
    --dataset_name "stas/openwebtext-10k" \
    --output_dir output_dir \
    --overwrite_output_dir \
    --do_train \
    --do_eval \
    --max_train_samples 1000 \
    --max_eval_samples 200 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --num_train_epochs 1 \
    --warmup_steps 8 \
    --block_size 64 \
    --fp16 \
    --report_to none

So look at the tqdm bars before training starts to see the symptom. And this is already a very truncated dataset.

@VictorSanh, @sgugger

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
stas00commented, May 21, 2021
1reaction
sguggercommented, May 20, 2021

The dataset caching is all relying on the datasets library, so the issue should probably be tracked here. Especially if this is a new change: since there was no change I’m aware of in run_clm recently it may be coming from a change there.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Running out of Memory with run_clm.py - Hugging Face Forums
I have already successfully fine-tuned a GPT2 model and I currently want to fine-tune a GPT2-Large model from the same 1.4 GB training...
Read more >
Loading a custom dataset - YouTube
This video is part of the Hugging Face course: http://huggingface.co/course Open in colab to run the code samples : ...
Read more >
Writing Distributed Applications with PyTorch
def run(rank, size): tensor = torch.zeros(1) if rank == 0: tensor += 1 # Send ... For example, in order to obtain the...
Read more >
Distributed Computing with PyTorch - Shiv Gehlot
For example, if we have two nodes servers with four GPUs each, ... and each GPU uses an exclusive chunk of the dataset...
Read more >
How to run an end to end example of distributed data parallel ...
wrap the optimizer in any distributed trainer (like cherry? cherry is a pytorch lib for things like this); do we do the usual...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found