Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] leaner CPU memory allocations with cpu offload

See original GitHub issue

Describe the bug

As the models get huge offloading is becoming more and more a thing, especially with inference. This Issue discusses inference, so it’s quite simple moving blocks-wise - no optimizer states to deal with.

I think deepspeed cpu offload consumes more cpu memory than it should. I use /usr/bin/time for a reliable max cpu memory usage measurements. Here is how I invoke the program and how I get the reports:

/usr/bin/time -v deepspeed --num_gpus 1 test_ds_inference.py
[....]
        Maximum resident set size (kbytes): 10754388

Here is the script - this one was adapted to use “sgugger/sharded-gpt-j-6B”, so that each shard is no more than 7GB to keep CPU memory allocations tight. i.e. we don’t load the full state_dict into the memory at once.

Please click to open the script - it’s very simply - main comments and ds config

#!/usr/bin/env python

# This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model
# into a single GPU
#
# 1. Use 1 GPU with CPU offload
# 2. Or use multiple GPUs instead
#
# First you need to install deepspeed: pip install deepspeed
#
# Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2
# small GPUs can handle it. or 1 small GPU and a lot of CPU memory.
#
# To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU -
# you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to
# process multiple inputs at once.
#
# The provided deepspeed config also activates CPU memory offloading, so chances are that if you
# have a lot of available CPU memory and you don't mind a slowdown you should be able to load a
# model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will
# run faster if you don't want offload to CPU - so disable that section then.
#
# To deploy on 1 gpu:
#
# deepspeed --num_gpus 1 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=1 t0.py
#
# To deploy on 2 gpus:
#
# deepspeed --num_gpus 2 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=2 t0.py


from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, AutoModel
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed
import os
import torch

os.environ["TOKENIZERS_PARALLELISM"] = "false"  # To avoid warnings about parallelism in tokenizers

# distributed setup
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()

model_name = "sgugger/sharded-gpt-j-6B"
#model_name = "t5-large"
#model_name = "t5-3b"

config = AutoConfig.from_pretrained(model_name)
model_hidden_size = config.hidden_size

# batch size has to be divisible by world_size, but can be bigger than world_size
train_batch_size = 1 * world_size

# ds_config notes
#
# - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be
# faster.
#
# - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g.
# all official t5 models are bf16-pretrained
#
# - set offload_param.device to "none" or completely remove the `offload_param` section if you don't
# - want CPU offload
#
# - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control
# - which params should remain on gpus - the larger the value the smaller the offload size
#
# For indepth info on Deepspeed config see
# https://huggingface.co/docs/transformers/main/main_classes/deepspeed

# keeping the same format as json for consistency, except it uses lower case for true/false
# fmt: off
ds_config = {
    "fp16": {
        "enabled": True
    },
    "bf16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "none",
            "pin_memory": True,
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
# fmt: on

# next line instructs transformers to partition the model directly over multiple gpus using
# deepspeed.zero.Init when model's `from_pretrained` method is called.
#
# **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)**
#
# otherwise the model will first be loaded normally and only partitioned at forward time which is
# less efficient and when there is little CPU RAM may fail
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive


# now a model can be loaded.
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

# initialise Deepspeed ZeRO and store only the engine object
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()  # inference


# Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once.
# If you use more GPUs adjust for more.
# And of course if you have just one input to process you then need to pass the same string to both gpus
# If you use only one GPU, then you will have only rank 0.
rank = torch.distributed.get_rank()
if rank == 0:
    text_in = "Hello, my name is"
elif rank == 1:
    text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")

So here are some breakdowns:


w/ cpu offload

0.6 GB before from_pretrained
22  GB after from_pretrained
32  GB after deepspeed.initialize

w/o cpu offload

0.6 GB before from_pretrained
10  GB after from_pretrained
15  GB after deepspeed.initialize

Here are the questions:

unaccounted for memory during offload:

The fp16 6B model should need only 12GB for CPU offload but as you can see from my breakdown we have another 5GB that I can’t account for (32GB - 15GB = 17GB - 12GB = 5GB) So w/ offload it was 32GB CPU mem, w/o offload 15GB - hence 17GB, out of those only 12GB needed to offload to CPU in fp16, and so 5GB is missing.

the other question is why when the model was allocated via zero.Init w/o offload it consumes 10GB of CPU memory and not close to 1GB? I bracketed the code here:

https://github.com/huggingface/transformers/blob/66e8656778392609e1fb769f1a0d0839af3cd76a/src/transformers/modeling_utils.py#L2109-L2120

which is really just the good old:

            with deepspeed.zero.Init(config_dict_or_path=deepspeed_config()):
                model = cls(config, **kwargs)

why did the CPU memory grow to 10GB from 0.6GB if the model ends up on GPU? should it not release all the temp memory there?

Again before zero.Init call the program’s size was just 0.6GB - no checkpoints were loaded yet.

In this 2nd question I wonder if it’s just the confusing memory management and it’s just cached memory.

@tjruwase , @jeffra

also cc: @sgugger who initially reported this issue

the env versions are unimportant - I tried with various versions it’s pretty consistent. so let’s assume ds@master for the sake of this ticket.

And thank you!

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:5 (5 by maintainers)

Top GitHub Comments

2reactions

tjruwasecommented, Jun 10, 2022

unaccounted for memory during offload:

This is due to unnecessary zero stage 3 memory allocation that is exposed because of the shared code base. To address this, I have embarked on the trivial task of pulling the entire offloading logic out of stage 3 code. Wish me luck 😃. Anyways, the preliminary results are as follows

master:

  w/ cpu_offload: Maximum resident set size (kbytes): 34524700
  w/o cpu_offload: Maximum resident set size (kbytes): 17296180

PR #2009

    w/ cpu_offload:    Maximum resident set size (kbytes): 24432692
    w/o cpu_offload: Maximum resident set size (kbytes): 12404376

1reaction

jeffracommented, Nov 4, 2022

Closing, addressed in https://github.com/microsoft/DeepSpeed/pull/2009

Top Results From Across the Web

CPU and Memory Allocations | Rancher Manager

CPU and Memory Allocations. This section describes the minimum recommended computing resources for the Istio components in a cluster.

How to Free Up RAM and Reduce RAM Usage on Windows

Let's take a look at some practical steps to clear RAM and keep your computer running smoothly. These apply to both Windows 10...

Allocate Memory - an overview | ScienceDirect Topics

In our scheme, GPU work-items leverage a CPU thread to allocate memory. The memory allocated is in the global segment, so the HSAIL...

How to free up memory (RAM) on a Mac - Macworld

Restarting your computer is the simplest way to free up RAM. There's a reason why the IT desk always tells you to “turn...

How to Lower CPU Usage: Common Causes & Tips - N-able

High physical memory usage is often a consequence of using too many demanding apps, but can also be the result of a bug...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[BUG] leaner CPU memory allocations with cpu offload

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[BUG] generate() with do_sample isn't done on multi-GPUs Stage3 at T5ForConditionalGeneration

[BUG] RuntimeError: start (0) + length () exceeds dimension size (1).