Running out of memory when resume training.
See original GitHub issueMight be similar problem as #11317, node runs out of cpu memory (512GB).
To reproduce:
(i)
deepspeed --hostfile myhostfile \ ${_PATH}/examples/pytorch/summarization/run_summarization.py \
--model_name_or_path hyunwoongko/blenderbot-9B \
--do_train \
--do_eval \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--source_prefix "summarize: " \
--output_dir /tmp/tst-summarization \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--deepspeed ${_PATH}/tests/deepspeed/ds_config_zero3.json \
--logging_steps 1 \
--fp16 \
--overwrite_output_dir \
--save_steps 10 \
--gradient_accumulation_steps 1 \
--evaluation_strategy="steps" \
--max_train_samples 10024 \
--max_eval_samples 32 \
--max_source_length 128
--max_target_length 128 \
--eval_steps 5
(ii)
Afterwards in order to resume I use the option --resume_from_checkpoint /tmp/tst-summarization/checkpoint-10
.
A workaround is to export the FP32 weights using the script zero_to_fp32.py
as described in https://huggingface.co/transformers/master/main_classes/deepspeed.html#getting-the-model-weights-out and restart directly from pytorch_model.bin
, nevertheless it would be better to resume directly from the deepspeed checkpoint, if possible.
torch: 1.8.1+cu111 transformers: 4.9.0.dev0 deepspeed: 0.4.4+d1a7a55
log: log.txt
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (8 by maintainers)
Top Results From Across the Web
Run_mlm.py cuda error memory after resuming a training
For some reason, running training from scratch with the Seq2SeqTrainer works just fine, but resuming from checkpoint exceeds the memory limit, ...
Read more >Out of memory error when resume training even though my ...
I am training a classification model and I have saved some checkpoints. When I try to resume training, however, I got out of...
Read more >Resuming pytorch model training raises error “CUDA out of ...
I was facing exactly the same issue, even my model was running out of memory on 80 GB A100, while starting from a...
Read more >Train a model — MMSegmentation 0.29.1 documentation
To trade speed with GPU memory, you may pass in --cfg-options ... Resume from a previous checkpoint file (to continue the training process)....
Read more >Stopping and Resuming a Tune Run - the Ray documentation
If you've stopped a run and and want to resume from where you left off, ... Status == Memory usage on this node:...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes, that did the trick! It’s the same memory usage now. Applied here: https://github.com/huggingface/transformers/pull/12718
The main issue is loading optimizer states which are 2x bigger than the fp32 model.
Actually, I thought of a possible solution last night. This is staggered checkpoint loading.
So if you have 4 gpus on a node, now you get the whole checkpoint folder loaded into CPU at once. However what if we loaded one gpu at a time! That would require 1/4th extra CPU memory as when one gpu finished loading it will return the CPU memory back to the pool.
I think this approach should solve your limitation. Let me try to implement this on the deepspeed side.