Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] DeepSpeedCPUAdam wrong result after calling step

See original GitHub issue

Hi, there.

Lately we used deepspeed to speed up our model training but often got unexpected cuda invalid resource handle error once cpu offload is enabled. We finally narrowed down this problem to DeepSpeedCPUAdam’s step method after making some efforts. So we tried to adapt code here to the following

import torch
from deepspeed.ops.adam import DeepSpeedCPUAdam
import time

device = 'cpu'
model_size = 1 * 1024**2
param1 = torch.nn.Parameter(torch.ones(model_size, device=device))
param_fp16_1 = torch.nn.Parameter(torch.ones(model_size,
                                             dtype=torch.half,
                                             device='cuda:0'))
param = torch.nn.Parameter(torch.ones(model_size, device=device))
param_fp16 = torch.nn.Parameter(torch.ones(model_size,
                                           dtype=torch.half,
                                           device='cuda:0'))

optimizer = DeepSpeedCPUAdam([param1, param], lr=0.01)
# torch.set_num_threads(128)
param1.grad = torch.ones(model_size, device=device)
param.grad = torch.ones(model_size, device=device)
avg = 0
for i in range(2):
    start = time.time()
    optimizer.step(fp16_param_groups=[param_fp16_1, param_fp16])
    stop = time.time()
    avg += (stop - start)
    param1.grad = torch.ones(model_size, device=device) * 2
    param.grad = torch.ones(model_size, device=device) * 2
    print('cpu param1: ', param1)
    print('gpu param_fp16_1: ', param_fp16_1.float())
    print('cpu param: ', param)
    print('gpu param_fp16: ', param_fp16.float())
print("Elapsed Time is ", avg / 100)

to see if this step method functions as expected in the latest release but got the following results:

Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.6612434387207031 seconds
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.010000, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
cpu param1:  Parameter containing:
tensor([0.9900, 0.9900, 0.9900,  ..., 0.9900, 0.9900, 0.9900],
       requires_grad=True)
gpu param_fp16_1:  tensor([0.9902, 0.9902, 0.9902,  ..., 0.9902, 0.9902, 0.9902], device='cuda:0',
       grad_fn=<ToCopyBackward0>)
cpu param:  Parameter containing:
tensor([0.9900, 0.9900, 0.9900,  ..., 0.9900, 0.9900, 0.9900],
       requires_grad=True)
gpu param_fp16:  tensor([0.9902, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000], device='cuda:0',
       grad_fn=<ToCopyBackward0>)
cpu param1:  Parameter containing:
tensor([0.9803, 0.9803, 0.9803,  ..., 0.9803, 0.9803, 0.9803],
       requires_grad=True)
gpu param_fp16_1:  tensor([0.9805, 0.9805, 0.9805,  ..., 0.9805, 0.9805, 0.9805], device='cuda:0',
       grad_fn=<ToCopyBackward0>)
cpu param:  Parameter containing:
tensor([0.9803, 0.9803, 0.9803,  ..., 0.9803, 0.9803, 0.9803],
       requires_grad=True)
gpu param_fp16:  tensor([0.9805, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000], device='cuda:0',
       grad_fn=<ToCopyBackward0>)
Elapsed Time is  0.0007363009452819824

apparently param_fp16 on gpu is not updated correctly as param_fp16_1 on gpu as only the first element is updated. and params on cpu side looks good to me. so i suspect that something wrong with https://github.com/microsoft/DeepSpeed/blob/8bbf081ad83fbc75f85087856e34d2344577edae/csrc/adam/cpu_adam.cpp#L235.

could you help to explain this strange behavior and maybe correct us in terms of possible wrong use? Or a quick fix is appreciated if this is indeed a bug.

Anyway, looking forward to your attention, thanks. @jeffra

Issue Analytics

State:
Created 2 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

RezaYazdaniAminabadicommented, Dec 20, 2021

@zgsbughammer Thanks for the good suggestion. I have added a PR to address this. Please try it and see if it works on your side. For the other part of the issue, I look forward to a reproable test to fix it. Thanks, Reza

0reactions

zgsbughammercommented, Dec 31, 2021

@tjruwase Thanks for ur attention, will close

Top Results From Across the Web

ZeRO — DeepSpeed 0.8.0 documentation - Read the Docs

This controls whether or not training should terminate with an error message when unused parameters are detected. This is set to False by...

DeepSpeed Configuration JSON

DeepSpeed calls the step() method of the scheduler at every training step when model_engine.step() is executed. scheduler: [dictionary]. Fields, Value, Example ...

This is how to get deepspeedCPUAdam to work in ...

-on my setup I was not able to get text encoder to work. I would get a error that had something to do...

PyTorch - CC Doc

You may get different results when running the exact same GPU-enabled ... to a model's performance on a given task (accuracy, error, etc.) ......

Configuring CentOS 7 to finetune EleutherAI GPT-NEO 2.7 ...

It's essential to install this version because, as of April 2021, Torch can't be compiled with Cuda 11.2, and I got this error...