question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Running out of memory when resume training.

See original GitHub issue

Might be similar problem as #11317, node runs out of cpu memory (512GB).

To reproduce:

(i)

deepspeed --hostfile myhostfile \ ${_PATH}/examples/pytorch/summarization/run_summarization.py \ 
--model_name_or_path hyunwoongko/blenderbot-9B \ 
--do_train \ 
--do_eval \ 
--dataset_name cnn_dailymail \ 
--dataset_config "3.0.0" \ 
--source_prefix "summarize: " \ 
--output_dir /tmp/tst-summarization \ 
--per_device_train_batch_size 8 \ 
--per_device_eval_batch_size 8 \ 
--deepspeed ${_PATH}/tests/deepspeed/ds_config_zero3.json \ 
--logging_steps 1 \ 
--fp16 \ 
--overwrite_output_dir \ 
--save_steps 10 \ 
--gradient_accumulation_steps 1 \ 
--evaluation_strategy="steps" \ 
--max_train_samples 10024 \ 
--max_eval_samples 32 \ 
--max_source_length 128 
--max_target_length 128 \ 
--eval_steps 5 

(ii) Afterwards in order to resume I use the option --resume_from_checkpoint /tmp/tst-summarization/checkpoint-10.

A workaround is to export the FP32 weights using the script zero_to_fp32.py as described in https://huggingface.co/transformers/master/main_classes/deepspeed.html#getting-the-model-weights-out and restart directly from pytorch_model.bin, nevertheless it would be better to resume directly from the deepspeed checkpoint, if possible.

torch: 1.8.1+cu111 transformers: 4.9.0.dev0 deepspeed: 0.4.4+d1a7a55

log: log.txt

@stas00

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
stas00commented, Jul 14, 2021

Yes, that did the trick! It’s the same memory usage now. Applied here: https://github.com/huggingface/transformers/pull/12718

1reaction
stas00commented, Jul 15, 2021

The main issue is loading optimizer states which are 2x bigger than the fp32 model.

Actually, I thought of a possible solution last night. This is staggered checkpoint loading.

So if you have 4 gpus on a node, now you get the whole checkpoint folder loaded into CPU at once. However what if we loaded one gpu at a time! That would require 1/4th extra CPU memory as when one gpu finished loading it will return the CPU memory back to the pool.

I think this approach should solve your limitation. Let me try to implement this on the deepspeed side.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Run_mlm.py cuda error memory after resuming a training
For some reason, running training from scratch with the Seq2SeqTrainer works just fine, but resuming from checkpoint exceeds the memory limit, ...
Read more >
Out of memory error when resume training even though my ...
I am training a classification model and I have saved some checkpoints. When I try to resume training, however, I got out of...
Read more >
Resuming pytorch model training raises error “CUDA out of ...
I was facing exactly the same issue, even my model was running out of memory on 80 GB A100, while starting from a...
Read more >
Train a model — MMSegmentation 0.29.1 documentation
To trade speed with GPU memory, you may pass in --cfg-options ... Resume from a previous checkpoint file (to continue the training process)....
Read more >
Stopping and Resuming a Tune Run - the Ray documentation
If you've stopped a run and and want to resume from where you left off, ... Status == Memory usage on this node:...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found