question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DeepSpeedCPUAdam triggers core dump at a specific shape (1024,) using FP16

See original GitHub issue

Describe the bug

I pass a tensor with shape is (1024,) and dtype is float16 to init DeepSpeedCPUAdam. But when I call step(), the program aborts.

But when the passed tensor’s shape is (1024, 1024) or its dtype is float32, the program works perfectly ok.

To Reproduce Just run the code below:

import torch
from deepspeed.ops.adam import DeepSpeedCPUAdam

N = 1024
device = torch.device("cpu")

tmp = torch.randn(N, device=device).half()
param_bias = torch.nn.Parameter(tmp)
param = [param_bias]
optimizer = DeepSpeedCPUAdam(param)
param_bias.grad = torch.randn(N, device=device).half()
optimizer.step()

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['xxx]
torch version .................... 1.10.0+cu111
torch cuda version ............... 11.1
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed install path ........... ['xxx']
deepspeed info ................... 0.6.6+3da84185, 3da84185, master
deepspeed wheel compiled w. ...... torch 1.10, cuda 11.1

System info (please complete the following information):

  • OS: Ubuntu 18.04
  • CPU: Intel Xeon E5-2620 v4 @ 16x 3GHz
  • Python version: 3.8

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
zzqq2199commented, Jun 22, 2022

@RezaYazdaniAminabadi Thank you very much! After testing, CPU-Adam is now working very well and has been able to train GPT/Bert models using fp16. I see from the commit log that you solved the bug very quickly, great code ability!

0reactions
RezaYazdaniAminabadicommented, Jun 20, 2022

Hi @zzqq2199

Sorry for the slow reply here. Could you please try this PR to see if it fixes the issue? Thanks, Reza

Read more comments on GitHub >

github_iconTop Results From Across the Web

DeepSpeed/stage3.py at master - GitHub
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective. - DeepSpeed/stage3.py at ...
Read more >
How to enable core file dumps when an application crashes or ...
Note:- There are a couple of ways to enable core file creation. ... Replace the above ulimit command in /etc/profile with the following ......
Read more >
How to obtain application core dumps | Support - SUSE
A core dump of a process needs to be captured for troubleshooting purposes ... Core dumps can be a valuable source of information...
Read more >
Configuring and Managing Core Dumps in Linux - Baeldung
A core dump is a file that gets automatically generated by the Linux kernel after a program crashes. This file contains the memory, ......
Read more >
How do I enable core dumps - IBM
To enable writing core files you use the ulimit command, it controls the resources available to a process started by the shell, on...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found