[BUG] DeepSpeedCPUAdam wrong result after calling step
See original GitHub issueHi, there.
Lately we used deepspeed to speed up our model training but often got unexpected cuda invalid resource handle error once cpu offload is enabled. We finally narrowed down this problem to DeepSpeedCPUAdam’s step method after making some efforts. So we tried to adapt code here to the following
import torch
from deepspeed.ops.adam import DeepSpeedCPUAdam
import time
device = 'cpu'
model_size = 1 * 1024**2
param1 = torch.nn.Parameter(torch.ones(model_size, device=device))
param_fp16_1 = torch.nn.Parameter(torch.ones(model_size,
dtype=torch.half,
device='cuda:0'))
param = torch.nn.Parameter(torch.ones(model_size, device=device))
param_fp16 = torch.nn.Parameter(torch.ones(model_size,
dtype=torch.half,
device='cuda:0'))
optimizer = DeepSpeedCPUAdam([param1, param], lr=0.01)
# torch.set_num_threads(128)
param1.grad = torch.ones(model_size, device=device)
param.grad = torch.ones(model_size, device=device)
avg = 0
for i in range(2):
start = time.time()
optimizer.step(fp16_param_groups=[param_fp16_1, param_fp16])
stop = time.time()
avg += (stop - start)
param1.grad = torch.ones(model_size, device=device) * 2
param.grad = torch.ones(model_size, device=device) * 2
print('cpu param1: ', param1)
print('gpu param_fp16_1: ', param_fp16_1.float())
print('cpu param: ', param)
print('gpu param_fp16: ', param_fp16.float())
print("Elapsed Time is ", avg / 100)
to see if this step method functions as expected in the latest release but got the following results:
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.6612434387207031 seconds
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.010000, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
cpu param1: Parameter containing:
tensor([0.9900, 0.9900, 0.9900, ..., 0.9900, 0.9900, 0.9900],
requires_grad=True)
gpu param_fp16_1: tensor([0.9902, 0.9902, 0.9902, ..., 0.9902, 0.9902, 0.9902], device='cuda:0',
grad_fn=<ToCopyBackward0>)
cpu param: Parameter containing:
tensor([0.9900, 0.9900, 0.9900, ..., 0.9900, 0.9900, 0.9900],
requires_grad=True)
gpu param_fp16: tensor([0.9902, 1.0000, 1.0000, ..., 1.0000, 1.0000, 1.0000], device='cuda:0',
grad_fn=<ToCopyBackward0>)
cpu param1: Parameter containing:
tensor([0.9803, 0.9803, 0.9803, ..., 0.9803, 0.9803, 0.9803],
requires_grad=True)
gpu param_fp16_1: tensor([0.9805, 0.9805, 0.9805, ..., 0.9805, 0.9805, 0.9805], device='cuda:0',
grad_fn=<ToCopyBackward0>)
cpu param: Parameter containing:
tensor([0.9803, 0.9803, 0.9803, ..., 0.9803, 0.9803, 0.9803],
requires_grad=True)
gpu param_fp16: tensor([0.9805, 1.0000, 1.0000, ..., 1.0000, 1.0000, 1.0000], device='cuda:0',
grad_fn=<ToCopyBackward0>)
Elapsed Time is 0.0007363009452819824
apparently param_fp16 on gpu is not updated correctly as param_fp16_1 on gpu as only the first element is updated. and params on cpu side looks good to me. so i suspect that something wrong with https://github.com/microsoft/DeepSpeed/blob/8bbf081ad83fbc75f85087856e34d2344577edae/csrc/adam/cpu_adam.cpp#L235.
could you help to explain this strange behavior and maybe correct us in terms of possible wrong use? Or a quick fix is appreciated if this is indeed a bug.
Anyway, looking forward to your attention, thanks. @jeffra
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
@zgsbughammer Thanks for the good suggestion. I have added a PR to address this. Please try it and see if it works on your side. For the other part of the issue, I look forward to a reproable test to fix it. Thanks, Reza
@tjruwase Thanks for ur attention, will close