cpu_offload error![BUG]
See original GitHub issueDescribe the bug Hello, I have meet a error when I turn on cpu_offload=true. It’s okay when I turn off cpu_offload=false. I trace the error in cpu_adam.py, find that self.param_groups size is not align with fp16_param_groups. Looking forward to your reply. here is my modify cpu_adam.py added print code:
'''
Copyright 2020 The Microsoft DeepSpeed Team
'''
import math
import torch
import time
from pathlib import Path
from ..op_builder import CPUAdamBuilder
from deepspeed.utils.logging import should_log_le
class DeepSpeedCPUAdam(torch.optim.Optimizer):
optimizer_id = 0
def __init__(self,
model_params,
lr=1e-3,
bias_correction=True,
betas=(0.9,
0.999),
eps=1e-8,
weight_decay=0,
amsgrad=False,
adamw_mode=True,
fp32_optimizer_states=True):
"""Fast vectorized implementation of two variations of Adam optimizer on CPU:
* Adam: A Method for Stochastic Optimization: (https://arxiv.org/abs/1412.6980);
* AdamW: Fixing Weight Decay Regularization in Adam (https://arxiv.org/abs/1711.05101)
DeepSpeed CPU Adam(W) provides between 5x to 7x speedup over torch.optim.adam(W).
In order to apply this optimizer, the model requires to have its master parameter (in FP32)
reside on the CPU memory.
To train on a heterogeneous system, such as coordinating CPU and GPU, DeepSpeed offers
the ZeRO-Offload technology which efficiently offloads the optimizer states into CPU memory,
with minimal impact on training throughput. DeepSpeedCPUAdam plays an important role to minimize
the overhead of the optimizer's latency on CPU. Please refer to ZeRO-Offload tutorial
(https://www.deepspeed.ai/tutorials/zero-offload/) for more information on how to enable this technology.
For calling step function, there are two options available: (1) update optimizer's states and (2) update
optimizer's states and copy the parameters back to GPU at the same time. We have seen that the second
option can bring 30% higher throughput than the doing the copy separately using option one.
.. note::
We recommend using our `config
<https://www.deepspeed.ai/docs/config-json/#optimizer-parameters>`_
to allow :meth:`deepspeed.initialize` to build this optimizer
for you.
Arguments:
model_params (iterable): iterable of parameters to optimize or dicts defining
parameter groups.
lr (float, optional): learning rate. (default: 1e-3)
betas (Tuple[float, float], optional): coefficients used for computing
running averages of gradient and its square. (default: (0.9, 0.999))
eps (float, optional): term added to the denominator to improve
numerical stability. (default: 1e-8)
weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
amsgrad (boolean, optional): whether to use the AMSGrad variant of this
algorithm from the paper `On the Convergence of Adam and Beyond`_
(default: False) NOT SUPPORTED in DeepSpeed CPUAdam!
adamw_mode: select between Adam and AdamW implementations (default: AdamW)
full_precision_optimizer_states: creates momementum and variance in full precision regardless of
the precision of the parameters (default: True)
"""
default_args = dict(lr=lr,
betas=betas,
eps=eps,
weight_decay=weight_decay,
bias_correction=bias_correction,
amsgrad=amsgrad)
super(DeepSpeedCPUAdam, self).__init__(model_params, default_args)
self.opt_id = DeepSpeedCPUAdam.optimizer_id
DeepSpeedCPUAdam.optimizer_id = DeepSpeedCPUAdam.optimizer_id + 1
self.adam_w_mode = adamw_mode
self.fp32_optimizer_states = fp32_optimizer_states
self.ds_opt_adam = CPUAdamBuilder().load()
self.ds_opt_adam.create_adam(self.opt_id,
lr,
betas[0],
betas[1],
eps,
weight_decay,
adamw_mode,
should_log_le("info"))
def __del__(self):
# need to destroy the C++ object explicitly to avoid a memory leak when deepspeed.initialize
# is used multiple times in the same process (notebook or pytest worker)
self.ds_opt_adam.destroy_adam(self.opt_id)
def __setstate__(self, state):
super(DeepSpeedCPUAdam, self).__setstate__(state)
for group in self.param_groups:
group.setdefault('amsgrad', False)
@torch.no_grad()
def step(self, closure=None, fp16_param_groups=None):
"""Update the model parameters.
.. note::
This method will be called internally by ZeRO-Offload. DeepSpeed
users should still use ``engine.step()`` as shown in the
`Getting Started
<https://www.deepspeed.ai/getting-started/#training>`_ guide.
Args:
closure (callable, optional): closure to compute the loss.
Defaults to ``None``.
fp16_param_groups: FP16 GPU parameters to update. Performing the
copy here reduces communication time. Defaults to ``None``.
Returns:
loss: if ``closure`` is provided. Otherwise ``None``.
"""
loss = None
if closure is not None:
with torch.enable_grad():
loss = closure()
# intended device for step
device = torch.device('cpu')
# converting the fp16 params to a group of parameter
if type(fp16_param_groups) is list:
if type(fp16_param_groups[0]) is not list:
fp16_param_groups = [fp16_param_groups]
elif fp16_param_groups is not None:
fp16_param_groups = [[fp16_param_groups]]
for group_id, group in enumerate(self.param_groups):
for param_id, p in enumerate(group['params']):
if p.grad is None:
continue
assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \
"sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config."
state = self.state[p]
# State initialization
if len(state) == 0:
#print(f'group {group_id} param {param_id} = {p.numel()}')
state['step'] = 0
#use full precision by default unless self.fp32_optimizer_states is off
state_dtype = torch.float if self.fp32_optimizer_states else p.dtype
# gradient momentums
state['exp_avg'] = torch.zeros_like(p.data,
dtype=state_dtype,
device=device)
#memory_format=torch.preserve_format)
# gradient variances
state['exp_avg_sq'] = torch.zeros_like(p.data,
dtype=state_dtype,
device=device)
#memory_format=torch.preserve_format)
state['step'] += 1
beta1, beta2 = group['betas']
def print_my(group_name, fp16_param_groups):
format_string = ' ('
for i, fgroup in enumerate(fp16_param_groups):
format_string += '\n'
format_string += 'Parameter Group {0}\n'.format(i)
for key in fgroup:
format_string += '{}: ,shape:{}\n'.format(key, key.shape)
format_string += ')'
print('{}:\n{}'.format(group_name,format_string),flush=True)
if fp16_param_groups is not None:
print('group_id:{}, param_id:{}'.format(group_id, param_id), flush=True)
print('param_group size:', len(self.param_groups),len(group['params']))
print('fp16_param_group size:', len(fp16_param_groups),len(fp16_param_groups[0]))
format_string = ' (\n'
format_string += '{}: ,shape:{}\n'.format(p, p.shape)
format_string += ')'
print('{}:\n{}\n'.format("param_group",format_string),flush=True)
# print_my("fp16_param_groups", fp16_param_groups)
format_string = ' (\n'
format_string += '{}: ,shape:{}\n'.format(fp16_param_groups[group_id][param_id], fp16_param_groups[group_id][param_id].shape)
format_string += ')'
print('{}:\n{}\n'.format("fp16_param_group",format_string),flush=True)
self.ds_opt_adam.adam_update_copy(
self.opt_id,
state['step'],
group['lr'],
beta1,
beta2,
group['eps'],
group['weight_decay'],
group['bias_correction'],
p.data,
p.grad.data,
state['exp_avg'],
state['exp_avg_sq'],
fp16_param_groups[group_id][param_id].data)
else:
self.ds_opt_adam.adam_update(self.opt_id,
state['step'],
group['lr'],
beta1,
beta2,
group['eps'],
group['weight_decay'],
group['bias_correction'],
p.data,
p.grad.data,
state['exp_avg'],
state['exp_avg_sq'])
return loss
here is the print:
here is the error information:
Traceback (most recent call last):
File "/home/yuanzheng/biobart/train.py", line 502, in <module>
main()
File "/home/yuanzheng/biobart/train.py", line 495, in main
run(args, model, optimizer, start_epoch)
File "/home/yuanzheng/biobart/train.py", line 446, in run
train(args, index, model, optimizer, pretrain_dataset_provider)
File "/home/yuanzheng/biobart/train.py", line 379, in train
model.network.step()
File "/home/yuanzheng/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1893, in step
14%|█▍ | 71/508 [01:30<09:19, 1.28s/it]
Traceback (most recent call last):
File "/home/yuanzheng/biobart/train.py", line 502, in <module>
self._take_model_step(lr_kwargs)
File "/home/yuanzheng/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1799, in _take_model_step
main()
File "/home/yuanzheng/biobart/train.py", line 495, in main
self.optimizer.step()
File "/home/yuanzheng/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1720, in step
run(args, model, optimizer, start_epoch)
File "/home/yuanzheng/biobart/train.py", line 446, in run
train(args, index, model, optimizer, pretrain_dataset_provider)
File "/home/yuanzheng/biobart/train.py", line 379, in train
self.optimizer.step(fp16_param_groups=bit16_param_groups) model.network.step()
File "/home/yuanzheng/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1893, in step
File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper
self._take_model_step(lr_kwargs)
File "/home/yuanzheng/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1799, in _take_model_step
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
self.optimizer.step()
File "/home/yuanzheng/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1720, in step
return func(*args, **kwargs)
File "/home/yuanzheng/.local/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 191, in step
self.optimizer.step(fp16_param_groups=bit16_param_groups)
File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/yuanzheng/.local/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 191, in step
format_string += '{}: ,shape:{}\n'.format(fp16_param_groups[group_id][param_id], fp16_param_groups[group_id][param_id].shape)
IndexError: list index out of range
format_string += '{}: ,shape:{}\n'.format(fp16_param_groups[group_id][param_id], fp16_param_groups[group_id][param_id].shape)
IndexError: list index out of range
To Reproduce
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 32,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 1e-4,
"betas": [0.9, 0.999],
"eps": 1e-6,
"weight_decay": 0.01
}
},
"zero_optimization": {
"stage": 2,
"cpu_offload":true,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": 4,
"gradient_clipping": 1,
"steps_per_print": 2000,
"train_micro_batch_size_per_gpu": 40,
"wall_clock_breakdown": false
}
Expected behavior Train okay when I turn on the cpu_offload=true
ds_report output
/home/yuanzheng/anaconda3/lib/python3.8/runpy.py:127: RuntimeWarning: 'deepspeed.env_report' found in sys.modules after import of package 'deepspeed', but prior to execution of 'deepspeed.env_report'; this may result in unpredictable behaviour
warn(RuntimeWarning(msg))
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/yuanzheng/anaconda3/lib/python3.8/site-packages/torch']
torch version .................... 1.9.1+cu111
torch cuda version ............... 11.1
nvcc version ..................... [FAIL] cannot find CUDA_HOME via torch.utils.cpp_extension.CUDA_HOME=None
deepspeed install path ........... ['/home/yuanzheng/.local/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.5.10, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.9, cuda 11.3
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: [e.g. Ubuntu 18.04]
- GPU count and types [e.g. one machines with x8 A100s each]
- Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
- Python version python 3.8
- Any other relevant info about your setup
Launcher context I use a slurm environment and torch launcher.
#!/bin/bash
#SBATCH --job-name=bart-dist
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node!
#SBATCH --gres=gpu:8 # number of gpus
#SBATCH -o ../logs/%x-%j.log
#SBATCH -e ../logs/%x-%j.err
set -x -e
echo "START TIME: $(date)"
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=12006
GPUS_PER_NODE=8
NNODES=1
export TORCH_EXTENSIONS_DIR="/platform_tech/yuanzheng/tmp/torch_extensions"
export LAUNCHER="python -u -m torch.distributed.launch \
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
--max_restarts 0 \
"
export CMD="
/home/yuanzheng/biobart/train.py --config-file ./bart_queue.json \
--output_dir /platform_tech/yuanzheng/biobart_ckps/biobart_test \
--token_nosing_prob 0.1 \
--max_seq_length 512 \
--max_predictions_per_seq 150 \
--seed 42 \
--lr_schedule LL \
--job_name biobart_pretrain \
--print_steps 10 \
--save_steps 1000 \
--data_path_prefix /platform_tech/yuanzheng/pretrain_data/ \
--deepspeed --deepspeed_config ./ds_config_zero2_queue.json
"
SINGULARITY_PATH=/platform_tech/yuanzheng/pytorch21_06_py3_docker_image_v2.sif
srun --jobid $SLURM_JOBID singularity exec --nv -B /platform_tech/yuanzheng/:/platform_tech/yuanzheng/,/home/yuanzheng/:/home/yuanzheng/ $SINGULARITY_PATH bash -c '$LAUNCHER --node_rank $SLURM_PROCID $CMD'
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:5 (2 by maintainers)
Top Results From Across the Web
The Benefits of Hardware Acceleration Offload for Storage ...
In general, the advantages to offloading will be higher overall appliance performance and lower latency with additional CPU cycles available to run the ......
Read more >Error - NVIDIA Networking Docs
Description: Fixed the issue where establishing TCP connection took too long due to failure of SA PathRecord query callback handler.
Read more >Should I enable TCP Offloading ? :: SG FAQ - SpeedGuide
In conclusion, yes, TCP Offloading speeds up the connection and reduces CPU utilization when it works, use it in client machines, and with...
Read more >Tethering Hardware Offload | Android Open Source Project
Starting in Android 8.1, devices can use tethering offload to offload ... address translation (NAT) session setup packets to reach the CPU.
Read more >[Solved] EdgeRouter X 2.0.9 hwnat offload isues
The CPU is also seen running at 100% indicating that hwnat offload is not ... It can't be considered as a bug but...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi @RezaYazdaniAminabadi, it works for me thanks a lot!
Hi @ganzhiruyi and @antoiloui
Just sent a PR to fix this. Can you please try this on your side? Thanks, Reza