Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OOM during saving step

See original GitHub issue

I’m trying to train the Blenderbot-9B model using the Deepspeed integration on 8 GPUs, each of them has 16GB VRAM (one node).

Script: deepspeed --hostfile myhostfile \ ${_PATH}/examples/pytorch/summarization/run_summarization.py \ --model_name_or_path hyunwoongko/blenderbot-9B \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --source_prefix "summarize: " \ --output_dir /tmp/tst-summarization \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 8 \ --deepspeed ${_PATH}/tests/deepspeed/ds_config_zero3.json \ --logging_steps 1 \ --fp16 \ --overwrite_output_dir \ --save_steps 10 \ --gradient_accumulation_steps 1 \ --evaluation_strategy="steps" \ --max_train_samples 10024 \ --max_eval_samples 32 \ --max_source_length 128 --max_target_length 128 \ --eval_steps 5

Training and evaluation seem to run fine, I see about 10GB of VRAM occupied on each GPU, so there is even free space left on the GPUs. However afterwards during the saving step I get OOM, which I don’t understand.

Log: log.txt

Deespeed: 0.4.3+c9fee82 torch 1.8, cuda 11.1

Transformers: ‘4.9.0.dev0’

Issue Analytics

State:
Created 2 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

2reactions

sguggercommented, Jul 12, 2021

I think this is more on the DeepSpeed side so cc-ing @stas00 to confirm.

1reaction

stas00commented, Jul 13, 2021

This version should do the right thing as all the tests now pass: https://github.com/microsoft/DeepSpeed/pull/1223

Unfortunately missed the new deepspeed release, so will enter the next one.

Do let me know if you encounter any issues with this PR branch.

Thank you.

Top Results From Across the Web

Resolve Out Of Memory Issues - SQL Server - Microsoft Learn

Obviously, it is best to not get into a low memory or OOM (Out of Memory) situation. Good planning and monitoring can help...

tensorflow OOM after some train steps when use Object ...

I use the Object Detection API code to train. I just want to use single GPU to train. with tf.Graph().as_default(): # Build a...

Collecting info for: OOM errors in TIP running on AIX, Linux, or ...

Gathering this information for OutOfMemory errors in Tivoli Integrated ... familiarize you with the troubleshooting process and save you time.

Debugging OOM exceptions and job abnormalities - AWS Glue

You can debug out-of-memory (OOM) exceptions and job abnormalities in AWS Glue. The following sections describe scenarios for debugging out-of-memory ...

OOM when running pipelines - Bitbucket - Atlassian Community

Solved: Hello all, as one (sometimes failing) step of our pipeline we setup environment via installation of docker which runs 3 containers ...