question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OOM during saving step

See original GitHub issue

I’m trying to train the Blenderbot-9B model using the Deepspeed integration on 8 GPUs, each of them has 16GB VRAM (one node).

Script: deepspeed --hostfile myhostfile \ ${_PATH}/examples/pytorch/summarization/run_summarization.py \ --model_name_or_path hyunwoongko/blenderbot-9B \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --source_prefix "summarize: " \ --output_dir /tmp/tst-summarization \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 8 \ --deepspeed ${_PATH}/tests/deepspeed/ds_config_zero3.json \ --logging_steps 1 \ --fp16 \ --overwrite_output_dir \ --save_steps 10 \ --gradient_accumulation_steps 1 \ --evaluation_strategy="steps" \ --max_train_samples 10024 \ --max_eval_samples 32 \ --max_source_length 128 --max_target_length 128 \ --eval_steps 5

Training and evaluation seem to run fine, I see about 10GB of VRAM occupied on each GPU, so there is even free space left on the GPUs. However afterwards during the saving step I get OOM, which I don’t understand.

Log: log.txt

Deespeed: 0.4.3+c9fee82 torch 1.8, cuda 11.1

Transformers: ‘4.9.0.dev0’

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
sguggercommented, Jul 12, 2021

I think this is more on the DeepSpeed side so cc-ing @stas00 to confirm.

1reaction
stas00commented, Jul 13, 2021

This version should do the right thing as all the tests now pass: https://github.com/microsoft/DeepSpeed/pull/1223

Unfortunately missed the new deepspeed release, so will enter the next one.

Do let me know if you encounter any issues with this PR branch.

Thank you.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Resolve Out Of Memory Issues - SQL Server - Microsoft Learn
Obviously, it is best to not get into a low memory or OOM (Out of Memory) situation. Good planning and monitoring can help...
Read more >
tensorflow OOM after some train steps when use Object ...
I use the Object Detection API code to train. I just want to use single GPU to train. with tf.Graph().as_default(): # Build a...
Read more >
Collecting info for: OOM errors in TIP running on AIX, Linux, or ...
Gathering this information for OutOfMemory errors in Tivoli Integrated ... familiarize you with the troubleshooting process and save you time.
Read more >
Debugging OOM exceptions and job abnormalities - AWS Glue
You can debug out-of-memory (OOM) exceptions and job abnormalities in AWS Glue. The following sections describe scenarios for debugging out-of-memory ...
Read more >
OOM when running pipelines - Bitbucket - Atlassian Community
Solved: Hello all, as one (sometimes failing) step of our pipeline we setup environment via installation of docker which runs 3 containers ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found