Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Running out of memory when resume training.

See original GitHub issue

Might be similar problem as #11317, node runs out of cpu memory (512GB).

To reproduce:

(i)

deepspeed --hostfile myhostfile \ ${_PATH}/examples/pytorch/summarization/run_summarization.py \ 
--model_name_or_path hyunwoongko/blenderbot-9B \ 
--do_train \ 
--do_eval \ 
--dataset_name cnn_dailymail \ 
--dataset_config "3.0.0" \ 
--source_prefix "summarize: " \ 
--output_dir /tmp/tst-summarization \ 
--per_device_train_batch_size 8 \ 
--per_device_eval_batch_size 8 \ 
--deepspeed ${_PATH}/tests/deepspeed/ds_config_zero3.json \ 
--logging_steps 1 \ 
--fp16 \ 
--overwrite_output_dir \ 
--save_steps 10 \ 
--gradient_accumulation_steps 1 \ 
--evaluation_strategy="steps" \ 
--max_train_samples 10024 \ 
--max_eval_samples 32 \ 
--max_source_length 128 
--max_target_length 128 \ 
--eval_steps 5

(ii) Afterwards in order to resume I use the option --resume_from_checkpoint /tmp/tst-summarization/checkpoint-10.

A workaround is to export the FP32 weights using the script zero_to_fp32.py as described in https://huggingface.co/transformers/master/main_classes/deepspeed.html#getting-the-model-weights-out and restart directly from pytorch_model.bin, nevertheless it would be better to resume directly from the deepspeed checkpoint, if possible.

torch: 1.8.1+cu111 transformers: 4.9.0.dev0 deepspeed: 0.4.4+d1a7a55

log: log.txt

@stas00

Issue Analytics

State:
Created 2 years ago
Comments:9 (8 by maintainers)

Top GitHub Comments

2reactions

stas00commented, Jul 14, 2021

Yes, that did the trick! It’s the same memory usage now. Applied here: https://github.com/huggingface/transformers/pull/12718

1reaction

stas00commented, Jul 15, 2021

The main issue is loading optimizer states which are 2x bigger than the fp32 model.

Actually, I thought of a possible solution last night. This is staggered checkpoint loading.

So if you have 4 gpus on a node, now you get the whole checkpoint folder loaded into CPU at once. However what if we loaded one gpu at a time! That would require 1/4th extra CPU memory as when one gpu finished loading it will return the CPU memory back to the pool.

I think this approach should solve your limitation. Let me try to implement this on the deepspeed side.

Top Results From Across the Web

Run_mlm.py cuda error memory after resuming a training

For some reason, running training from scratch with the Seq2SeqTrainer works just fine, but resuming from checkpoint exceeds the memory limit, ...

Out of memory error when resume training even though my ...

I am training a classification model and I have saved some checkpoints. When I try to resume training, however, I got out of...

Resuming pytorch model training raises error “CUDA out of ...

I was facing exactly the same issue, even my model was running out of memory on 80 GB A100, while starting from a...

Train a model — MMSegmentation 0.29.1 documentation

To trade speed with GPU memory, you may pass in --cfg-options ... Resume from a previous checkpoint file (to continue the training process)....

Stopping and Resuming a Tune Run - the Ray documentation

If you've stopped a run and and want to resume from where you left off, ... Status == Memory usage on this node:...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Running out of memory when resume training.

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[trainer] `--load_best_model_at_end` silently turns of `--save_steps` settings

Inconsistency between the tokenization of `CLIPTokenizer` and `CLIPTokenizerFast` with `openai/clip-vit-base-patch32`