[BUG] leaner CPU memory allocations with cpu offload
See original GitHub issueDescribe the bug
As the models get huge offloading is becoming more and more a thing, especially with inference. This Issue discusses inference, so it’s quite simple moving blocks-wise - no optimizer states to deal with.
I think deepspeed cpu offload consumes more cpu memory than it should. I use /usr/bin/time
for a reliable max cpu memory usage measurements. Here is how I invoke the program and how I get the reports:
/usr/bin/time -v deepspeed --num_gpus 1 test_ds_inference.py
[....]
Maximum resident set size (kbytes): 10754388
Here is the script - this one was adapted to use “sgugger/sharded-gpt-j-6B”, so that each shard is no more than 7GB to keep CPU memory allocations tight. i.e. we don’t load the full state_dict
into the memory at once.
Please click to open the script - it’s very simply - main comments and ds config
#!/usr/bin/env python
# This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model
# into a single GPU
#
# 1. Use 1 GPU with CPU offload
# 2. Or use multiple GPUs instead
#
# First you need to install deepspeed: pip install deepspeed
#
# Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2
# small GPUs can handle it. or 1 small GPU and a lot of CPU memory.
#
# To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU -
# you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to
# process multiple inputs at once.
#
# The provided deepspeed config also activates CPU memory offloading, so chances are that if you
# have a lot of available CPU memory and you don't mind a slowdown you should be able to load a
# model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will
# run faster if you don't want offload to CPU - so disable that section then.
#
# To deploy on 1 gpu:
#
# deepspeed --num_gpus 1 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=1 t0.py
#
# To deploy on 2 gpus:
#
# deepspeed --num_gpus 2 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=2 t0.py
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, AutoModel
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed
import os
import torch
os.environ["TOKENIZERS_PARALLELISM"] = "false" # To avoid warnings about parallelism in tokenizers
# distributed setup
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()
model_name = "sgugger/sharded-gpt-j-6B"
#model_name = "t5-large"
#model_name = "t5-3b"
config = AutoConfig.from_pretrained(model_name)
model_hidden_size = config.hidden_size
# batch size has to be divisible by world_size, but can be bigger than world_size
train_batch_size = 1 * world_size
# ds_config notes
#
# - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be
# faster.
#
# - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g.
# all official t5 models are bf16-pretrained
#
# - set offload_param.device to "none" or completely remove the `offload_param` section if you don't
# - want CPU offload
#
# - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control
# - which params should remain on gpus - the larger the value the smaller the offload size
#
# For indepth info on Deepspeed config see
# https://huggingface.co/docs/transformers/main/main_classes/deepspeed
# keeping the same format as json for consistency, except it uses lower case for true/false
# fmt: off
ds_config = {
"fp16": {
"enabled": True
},
"bf16": {
"enabled": False
},
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "none",
"pin_memory": True,
},
"overlap_comm": True,
"contiguous_gradients": True,
"reduce_bucket_size": model_hidden_size * model_hidden_size,
"stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
"stage3_param_persistence_threshold": 10 * model_hidden_size
},
"steps_per_print": 2000,
"train_batch_size": train_batch_size,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": False
}
# fmt: on
# next line instructs transformers to partition the model directly over multiple gpus using
# deepspeed.zero.Init when model's `from_pretrained` method is called.
#
# **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)**
#
# otherwise the model will first be loaded normally and only partitioned at forward time which is
# less efficient and when there is little CPU RAM may fail
dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
# now a model can be loaded.
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
# initialise Deepspeed ZeRO and store only the engine object
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval() # inference
# Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once.
# If you use more GPUs adjust for more.
# And of course if you have just one input to process you then need to pass the same string to both gpus
# If you use only one GPU, then you will have only rank 0.
rank = torch.distributed.get_rank()
if rank == 0:
text_in = "Hello, my name is"
elif rank == 1:
text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
with torch.no_grad():
outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n in={text_in}\n out={text_out}")
So here are some breakdowns:
w/ cpu offload
0.6 GB before from_pretrained
22 GB after from_pretrained
32 GB after deepspeed.initialize
w/o cpu offload
0.6 GB before from_pretrained
10 GB after from_pretrained
15 GB after deepspeed.initialize
Here are the questions:
- unaccounted for memory during offload:
The fp16 6B model should need only 12GB for CPU offload but as you can see from my breakdown we have another 5GB that I can’t account for (32GB - 15GB = 17GB - 12GB = 5GB) So w/ offload it was 32GB CPU mem, w/o offload 15GB - hence 17GB, out of those only 12GB needed to offload to CPU in fp16, and so 5GB is missing.
- the other question is why when the model was allocated via
zero.Init
w/o offload it consumes 10GB of CPU memory and not close to 1GB? I bracketed the code here:
which is really just the good old:
with deepspeed.zero.Init(config_dict_or_path=deepspeed_config()):
model = cls(config, **kwargs)
why did the CPU memory grow to 10GB from 0.6GB if the model ends up on GPU? should it not release all the temp memory there?
Again before zero.Init call the program’s size was just 0.6GB - no checkpoints were loaded yet.
In this 2nd question I wonder if it’s just the confusing memory management and it’s just cached memory.
also cc: @sgugger who initially reported this issue
the env versions are unimportant - I tried with various versions it’s pretty consistent. so let’s assume ds@master for the sake of this ticket.
And thank you!
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:5 (5 by maintainers)
Top GitHub Comments
This is due to unnecessary zero stage 3 memory allocation that is exposed because of the shared code base. To address this, I have embarked on the trivial task of pulling the entire offloading logic out of stage 3 code. Wish me luck 😃. Anyways, the preliminary results are as follows
master:
PR #2009
Closing, addressed in https://github.com/microsoft/DeepSpeed/pull/2009