OOM during saving step
See original GitHub issueI’m trying to train the Blenderbot-9B model using the Deepspeed integration on 8 GPUs, each of them has 16GB VRAM (one node).
Script:
deepspeed --hostfile myhostfile \ ${_PATH}/examples/pytorch/summarization/run_summarization.py \ --model_name_or_path hyunwoongko/blenderbot-9B \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --source_prefix "summarize: " \ --output_dir /tmp/tst-summarization \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 8 \ --deepspeed ${_PATH}/tests/deepspeed/ds_config_zero3.json \ --logging_steps 1 \ --fp16 \ --overwrite_output_dir \ --save_steps 10 \ --gradient_accumulation_steps 1 \ --evaluation_strategy="steps" \ --max_train_samples 10024 \ --max_eval_samples 32 \ --max_source_length 128 --max_target_length 128 \ --eval_steps 5
Training and evaluation seem to run fine, I see about 10GB of VRAM occupied on each GPU, so there is even free space left on the GPUs. However afterwards during the saving step I get OOM, which I don’t understand.
Log: log.txt
Deespeed: 0.4.3+c9fee82 torch 1.8, cuda 11.1
Transformers: ‘4.9.0.dev0’
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (6 by maintainers)
Top GitHub Comments
I think this is more on the DeepSpeed side so cc-ing @stas00 to confirm.
This version should do the right thing as all the tests now pass: https://github.com/microsoft/DeepSpeed/pull/1223
Unfortunately missed the new deepspeed release, so will enter the next one.
Do let me know if you encounter any issues with this PR branch.
Thank you.