question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cpu_offload error![BUG]

See original GitHub issue

Describe the bug Hello, I have meet a error when I turn on cpu_offload=true. It’s okay when I turn off cpu_offload=false. I trace the error in cpu_adam.py, find that self.param_groups size is not align with fp16_param_groups. Looking forward to your reply. here is my modify cpu_adam.py added print code:

'''
Copyright 2020 The Microsoft DeepSpeed Team
'''

import math
import torch
import time
from pathlib import Path
from ..op_builder import CPUAdamBuilder
from deepspeed.utils.logging import should_log_le


class DeepSpeedCPUAdam(torch.optim.Optimizer):
    optimizer_id = 0

    def __init__(self,
                 model_params,
                 lr=1e-3,
                 bias_correction=True,
                 betas=(0.9,
                        0.999),
                 eps=1e-8,
                 weight_decay=0,
                 amsgrad=False,
                 adamw_mode=True,
                 fp32_optimizer_states=True):
        """Fast vectorized implementation of two variations of Adam optimizer on CPU:

        * Adam: A Method for Stochastic Optimization: (https://arxiv.org/abs/1412.6980);
        * AdamW: Fixing Weight Decay Regularization in Adam (https://arxiv.org/abs/1711.05101)

        DeepSpeed CPU Adam(W) provides between 5x to 7x speedup over torch.optim.adam(W).
        In order to apply this optimizer, the model requires to have its master parameter (in FP32)
        reside on the CPU memory.

        To train on a heterogeneous system, such as coordinating CPU and GPU, DeepSpeed offers
        the ZeRO-Offload technology which efficiently offloads the optimizer states into CPU memory,
        with minimal impact on training throughput. DeepSpeedCPUAdam plays an important role to minimize
        the overhead of the optimizer's latency on CPU. Please refer to ZeRO-Offload tutorial
        (https://www.deepspeed.ai/tutorials/zero-offload/) for more information on how to enable this technology.

        For calling step function, there are two options available: (1) update optimizer's states and (2) update
        optimizer's states and copy the parameters back to GPU at the same time. We have seen that the second
        option can bring 30% higher throughput than the doing the copy separately using option one.


        .. note::
                We recommend using our `config
                <https://www.deepspeed.ai/docs/config-json/#optimizer-parameters>`_
                to allow :meth:`deepspeed.initialize` to build this optimizer
                for you.


        Arguments:
            model_params (iterable): iterable of parameters to optimize or dicts defining
                parameter groups.
            lr (float, optional): learning rate. (default: 1e-3)
            betas (Tuple[float, float], optional): coefficients used for computing
                running averages of gradient and its square. (default: (0.9, 0.999))
            eps (float, optional): term added to the denominator to improve
                numerical stability. (default: 1e-8)
            weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
            amsgrad (boolean, optional): whether to use the AMSGrad variant of this
                algorithm from the paper `On the Convergence of Adam and Beyond`_
                (default: False) NOT SUPPORTED in DeepSpeed CPUAdam!
            adamw_mode: select between Adam and AdamW implementations (default: AdamW)
            full_precision_optimizer_states: creates momementum and variance in full precision regardless of
                        the precision of the parameters (default: True)
        """

        default_args = dict(lr=lr,
                            betas=betas,
                            eps=eps,
                            weight_decay=weight_decay,
                            bias_correction=bias_correction,
                            amsgrad=amsgrad)
        super(DeepSpeedCPUAdam, self).__init__(model_params, default_args)

        self.opt_id = DeepSpeedCPUAdam.optimizer_id
        DeepSpeedCPUAdam.optimizer_id = DeepSpeedCPUAdam.optimizer_id + 1
        self.adam_w_mode = adamw_mode
        self.fp32_optimizer_states = fp32_optimizer_states
        self.ds_opt_adam = CPUAdamBuilder().load()

        self.ds_opt_adam.create_adam(self.opt_id,
                                     lr,
                                     betas[0],
                                     betas[1],
                                     eps,
                                     weight_decay,
                                     adamw_mode,
                                     should_log_le("info"))

    def __del__(self):
        # need to destroy the C++ object explicitly to avoid a memory leak when deepspeed.initialize
        # is used multiple times in the same process (notebook or pytest worker)
        self.ds_opt_adam.destroy_adam(self.opt_id)

    def __setstate__(self, state):
        super(DeepSpeedCPUAdam, self).__setstate__(state)
        for group in self.param_groups:
            group.setdefault('amsgrad', False)

    @torch.no_grad()
    def step(self, closure=None, fp16_param_groups=None):
        """Update the model parameters.

        .. note::
            This method will be called internally by ZeRO-Offload. DeepSpeed
            users should still use ``engine.step()`` as shown in the
            `Getting Started
            <https://www.deepspeed.ai/getting-started/#training>`_ guide.

        Args:
            closure (callable, optional): closure to compute the loss.
                Defaults to ``None``.
            fp16_param_groups: FP16 GPU parameters to update. Performing the
                copy here reduces communication time. Defaults to ``None``.

        Returns:
            loss: if ``closure`` is provided. Otherwise ``None``.
        """

        loss = None
        if closure is not None:
            with torch.enable_grad():
                loss = closure()

        # intended device for step
        device = torch.device('cpu')

        # converting the fp16 params to a group of parameter
        if type(fp16_param_groups) is list:
            if type(fp16_param_groups[0]) is not list:
                fp16_param_groups = [fp16_param_groups]
        elif fp16_param_groups is not None:
            fp16_param_groups = [[fp16_param_groups]]

        for group_id, group in enumerate(self.param_groups):
            for param_id, p in enumerate(group['params']):

                if p.grad is None:
                    continue

                assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \
                        "sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config."

                state = self.state[p]
                # State initialization
                if len(state) == 0:
                    #print(f'group {group_id} param {param_id} = {p.numel()}')
                    state['step'] = 0

                    #use full precision by default unless self.fp32_optimizer_states is off
                    state_dtype = torch.float if self.fp32_optimizer_states else p.dtype

                    # gradient momentums
                    state['exp_avg'] = torch.zeros_like(p.data,
                                                        dtype=state_dtype,
                                                        device=device)
                    #memory_format=torch.preserve_format)
                    # gradient variances
                    state['exp_avg_sq'] = torch.zeros_like(p.data,
                                                           dtype=state_dtype,
                                                           device=device)
                    #memory_format=torch.preserve_format)

                state['step'] += 1
                beta1, beta2 = group['betas']

                def print_my(group_name, fp16_param_groups):
                    format_string = ' ('
                    for i, fgroup in enumerate(fp16_param_groups):
                        format_string += '\n'
                        format_string += 'Parameter Group {0}\n'.format(i)
                        for key in fgroup:
                            format_string += '{}: ,shape:{}\n'.format(key, key.shape)
                    format_string += ')'
                    print('{}:\n{}'.format(group_name,format_string),flush=True)
                
                if fp16_param_groups is not None:
                    print('group_id:{}, param_id:{}'.format(group_id, param_id), flush=True)
                    print('param_group size:', len(self.param_groups),len(group['params']))
                    print('fp16_param_group size:', len(fp16_param_groups),len(fp16_param_groups[0]))
                    format_string = ' (\n'
                    format_string += '{}: ,shape:{}\n'.format(p, p.shape)
                    format_string += ')'
                    print('{}:\n{}\n'.format("param_group",format_string),flush=True)
                    # print_my("fp16_param_groups", fp16_param_groups)
                    format_string = ' (\n'
                    format_string += '{}: ,shape:{}\n'.format(fp16_param_groups[group_id][param_id], fp16_param_groups[group_id][param_id].shape)
                    format_string += ')'
                    print('{}:\n{}\n'.format("fp16_param_group",format_string),flush=True)

                    self.ds_opt_adam.adam_update_copy(
                        self.opt_id,
                        state['step'],
                        group['lr'],
                        beta1,
                        beta2,
                        group['eps'],
                        group['weight_decay'],
                        group['bias_correction'],
                        p.data,
                        p.grad.data,
                        state['exp_avg'],
                        state['exp_avg_sq'],
                        fp16_param_groups[group_id][param_id].data)
                else:
                    self.ds_opt_adam.adam_update(self.opt_id,
                                                 state['step'],
                                                 group['lr'],
                                                 beta1,
                                                 beta2,
                                                 group['eps'],
                                                 group['weight_decay'],
                                                 group['bias_correction'],
                                                 p.data,
                                                 p.grad.data,
                                                 state['exp_avg'],
                                                 state['exp_avg_sq'])
        return loss

here is the print: image

here is the error information:

Traceback (most recent call last):
  File "/home/yuanzheng/biobart/train.py", line 502, in <module>
    main()
  File "/home/yuanzheng/biobart/train.py", line 495, in main
    run(args, model, optimizer, start_epoch)
  File "/home/yuanzheng/biobart/train.py", line 446, in run
    train(args, index, model, optimizer, pretrain_dataset_provider)
  File "/home/yuanzheng/biobart/train.py", line 379, in train
    model.network.step()
  File "/home/yuanzheng/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1893, in step

 14%|█▍        | 71/508 [01:30<09:19,  1.28s/it]
Traceback (most recent call last):
  File "/home/yuanzheng/biobart/train.py", line 502, in <module>
    self._take_model_step(lr_kwargs)
  File "/home/yuanzheng/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1799, in _take_model_step
    main()
  File "/home/yuanzheng/biobart/train.py", line 495, in main
    self.optimizer.step()
  File "/home/yuanzheng/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1720, in step
    run(args, model, optimizer, start_epoch)
  File "/home/yuanzheng/biobart/train.py", line 446, in run
    train(args, index, model, optimizer, pretrain_dataset_provider)
  File "/home/yuanzheng/biobart/train.py", line 379, in train
    self.optimizer.step(fp16_param_groups=bit16_param_groups)    model.network.step()
  File "/home/yuanzheng/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1893, in step

  File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper
    self._take_model_step(lr_kwargs)
  File "/home/yuanzheng/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1799, in _take_model_step
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    self.optimizer.step()
  File "/home/yuanzheng/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1720, in step
    return func(*args, **kwargs)
  File "/home/yuanzheng/.local/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 191, in step
    self.optimizer.step(fp16_param_groups=bit16_param_groups)
  File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/yuanzheng/.local/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 191, in step
    format_string += '{}: ,shape:{}\n'.format(fp16_param_groups[group_id][param_id], fp16_param_groups[group_id][param_id].shape)
IndexError: list index out of range
    format_string += '{}: ,shape:{}\n'.format(fp16_param_groups[group_id][param_id], fp16_param_groups[group_id][param_id].shape)
IndexError: list index out of range

To Reproduce

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 32,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 1e-4,
            "betas": [0.9, 0.999],
            "eps": 1e-6,
            "weight_decay": 0.01
        }
    },

    "zero_optimization": {
        "stage": 2,
	"cpu_offload":true,
        "allgather_partitions": true,
        "allgather_bucket_size": 5e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 5e8,
        "contiguous_gradients": true
    },

    "gradient_accumulation_steps": 4,
    "gradient_clipping": 1,
    "steps_per_print": 2000,
    "train_micro_batch_size_per_gpu": 40,
    "wall_clock_breakdown": false
}

Expected behavior Train okay when I turn on the cpu_offload=true

ds_report output

/home/yuanzheng/anaconda3/lib/python3.8/runpy.py:127: RuntimeWarning: 'deepspeed.env_report' found in sys.modules after import of package 'deepspeed', but prior to execution of 'deepspeed.env_report'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/yuanzheng/anaconda3/lib/python3.8/site-packages/torch']
torch version .................... 1.9.1+cu111
torch cuda version ............... 11.1
nvcc version .....................  [FAIL] cannot find CUDA_HOME via torch.utils.cpp_extension.CUDA_HOME=None 
deepspeed install path ........... ['/home/yuanzheng/.local/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.5.10, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.9, cuda 11.3

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: [e.g. Ubuntu 18.04]
  • GPU count and types [e.g. one machines with x8 A100s each]
  • Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
  • Python version python 3.8
  • Any other relevant info about your setup

Launcher context I use a slurm environment and torch launcher.

#!/bin/bash
#SBATCH --job-name=bart-dist
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1          # crucial - only 1 task per dist per node!
#SBATCH --gres=gpu:8                 # number of gpus
#SBATCH -o ../logs/%x-%j.log
#SBATCH -e ../logs/%x-%j.err

set -x -e
echo "START TIME: $(date)"
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=12006
GPUS_PER_NODE=8
NNODES=1
export TORCH_EXTENSIONS_DIR="/platform_tech/yuanzheng/tmp/torch_extensions"

export LAUNCHER="python -u -m torch.distributed.launch \
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT \
    --max_restarts 0 \
    "
export CMD="
    /home/yuanzheng/biobart/train.py --config-file ./bart_queue.json \
    --output_dir /platform_tech/yuanzheng/biobart_ckps/biobart_test \
    --token_nosing_prob 0.1 \
    --max_seq_length 512 \
    --max_predictions_per_seq 150 \
    --seed 42 \
    --lr_schedule LL \
    --job_name biobart_pretrain \
    --print_steps 10 \
    --save_steps 1000 \
    --data_path_prefix /platform_tech/yuanzheng/pretrain_data/ \
    --deepspeed --deepspeed_config ./ds_config_zero2_queue.json
"

SINGULARITY_PATH=/platform_tech/yuanzheng/pytorch21_06_py3_docker_image_v2.sif
srun --jobid $SLURM_JOBID singularity exec --nv -B /platform_tech/yuanzheng/:/platform_tech/yuanzheng/,/home/yuanzheng/:/home/yuanzheng/ $SINGULARITY_PATH bash -c '$LAUNCHER --node_rank $SLURM_PROCID $CMD'

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
antoilouicommented, Feb 16, 2022

Hi @RezaYazdaniAminabadi, it works for me thanks a lot!

1reaction
RezaYazdaniAminabadicommented, Feb 15, 2022

Hi @ganzhiruyi and @antoiloui

Just sent a PR to fix this. Can you please try this on your side? Thanks, Reza

Read more comments on GitHub >

github_iconTop Results From Across the Web

The Benefits of Hardware Acceleration Offload for Storage ...
In general, the advantages to offloading will be higher overall appliance performance and lower latency with additional CPU cycles available to run the ......
Read more >
Error - NVIDIA Networking Docs
Description: Fixed the issue where establishing TCP connection took too long due to failure of SA PathRecord query callback handler.
Read more >
Should I enable TCP Offloading ? :: SG FAQ - SpeedGuide
In conclusion, yes, TCP Offloading speeds up the connection and reduces CPU utilization when it works, use it in client machines, and with...
Read more >
Tethering Hardware Offload | Android Open Source Project
Starting in Android 8.1, devices can use tethering offload to offload ... address translation (NAT) session setup packets to reach the CPU.
Read more >
[Solved] EdgeRouter X 2.0.9 hwnat offload isues
The CPU is also seen running at 100% indicating that hwnat offload is not ... It can't be considered as a bug but...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found