Scalar Type Error when enable FP16 (`RuntimeError: Expected object of scalar type Half...`)
See original GitHub issueWhen using deepspeed 0.3 / 0.3.2 with FP16
enabled I got a scalar type error
RuntimeError: Expected object of scalar type Half but got scalar type Float for argument #2 'mat1' in call to _th_addmm
In my previous projects using deepspeed 0.1 / 0.2, I’ve never met this error. Anyone has any ideas?
Thanks in advance!
Following is the demo script
import os
import argparse
import socket
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset
import deepspeed
def parse_args():
parser = argparse.ArgumentParser(description='DeepSpeed ZeRO demo')
parser.add_argument("--local_rank", type=int)
parser = deepspeed.add_config_arguments(parser)
return parser.parse_args()
def gen_data():
inps = torch.arange(1024 * 10, dtype=torch.float32).view(1024, 10)
tgts = torch.arange(1024 * 5, dtype=torch.float32).view(1024, 5)
return TensorDataset(inps, tgts)
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = nn.Linear(10, 10)
self.relu = nn.ReLU()
self.net2 = nn.Linear(10, 5)
def forward(self, x):
x = self.net1(x)
return self.net2(self.relu(x))
def run(model_engine, train_loader):
loss_fn = nn.MSELoss()
model_engine.train()
for i, batch in enumerate(train_loader):
x = batch[0].to(model_engine.local_rank)
y = batch[1].to(model_engine.local_rank)
outputs = model_engine(x)
model_engine.backward(loss_fn(outputs, y))
model_engine.step()
def demo_zero(config):
print(f'Running ZeRO example on local_rank {config.local_rank}.')
model = ToyModel()
dataset = gen_data()
model_engine, optimizer, train_loader, lr_scheduler = deepspeed.initialize(
args=config, model=model, model_parameters=model.parameters(),
training_data=dataset)
run(model_engine, train_loader)
print(f'Local_rank {config.local_rank} job finished')
if __name__ == "__main__":
args = parse_args()
demo_zero(args)
Deepspeed configuration
{
"train_batch_size": 8,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00015
}
},
"fp16": {
"enabled": true,
"loss_scale": 0,
"initial_scale_power": 32,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 0,
"allgather_partitions": true,
"allgather_bucket_size": 500000000,
"overlap_comm": false,
"reduce_scatter": true,
"reduce_bucket_size": 500000000,
"contiguous_gradients" : false,
"cpu_offload": false
}
}
ds_report
outputs
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
/bin/sh: line 0: type: llvm-config-9: not found
[WARNING] sparse_attn requires a torch version >= 1.5 but detected 1.4
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/*/packages/conda/miniconda3/envs/torch1p4/lib/python3.6/site-packages/torch']
torch version .................... 1.4.0
torch cuda version ............... 10.1
nvcc version ..................... 10.1
deepspeed install path ........... ['/*/packages/conda/miniconda3/envs/torch1p4/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.3.2, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.4, cuda 10.1
The error message
Traceback (most recent call last):
File "/*/202003_ZERO_Test/ddl_introduction/ds_demo_zero.py", line 73, in <module>
demo_zero(args)
File "/*/202003_ZERO_Test/ddl_introduction/ds_demo_zero.py", line 67, in demo_zero
run(model_engine, train_loader)
File "/*/202003_ZERO_Test/ddl_introduction/ds_demo_zero.py", line 53, in run
outputs = model_engine(x)
File "/*/packages/conda/miniconda3/envs/torch1p4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/*/packages/conda/miniconda3/envs/torch1p4/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 796, in forward
loss = self.module(*inputs, **kwargs)
File "/*/packages/conda/miniconda3/envs/torch1p4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/*/202003_ZERO_Test/ddl_introduction/ds_demo_zero.py", line 43, in forward
x = self.net1(x)
File "/*/packages/conda/miniconda3/envs/torch1p4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/*/packages/conda/miniconda3/envs/torch1p4/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 87, in forward
return F.linear(input, self.weight, self.bias)
File "/*/packages/conda/miniconda3/envs/torch1p4/lib/python3.6/site-packages/torch/nn/functional.py", line 1370, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: Expected object of scalar type Half but got scalar type Float for argument #2 'mat1' in call to _th_addmm
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
expected scalar type Half but found Float" when using fp16
as no gpu - 'Cuda' available. if i set torch_dtype=torch.float16, thn it throws RuntimeError: expected scalar type Float but found BFloat16. if ...
Read more >RuntimeError: expected scalar type Float but found Half
I tried to decorate all the forward passes in the subsequent functions with torch.cuda.amp.autocast(enabled=True) but the error persists.
Read more >RuntimeError: expected scalar type Float but found Long ...
RuntimeError : Expected object of scalar type Long but got scalar type Float for argument #2 'target' in call to _thnn_nll_loss_forward on line ......
Read more >Pytorch Error, Runtimeerror: Expected Scalar Type Long But ...
When using deepspeed 0.3 / 0.3.2 with FP16 enabled I got a scalar type error RuntimeError: Expected object of scalar type Half but...
Read more >How to fix RuntimeError "Expected object of scalar type Float ...
PYTHON : How to fix RuntimeError " Expected object of scalar type Float but got scalar type Double for argument"?
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@Yicheng-G, auto-casting can be tricky since it can expose the user to silent precision issues. We have generally assumed that when fp16 or mixed precision training is enabled, that users will take care to prepare fp16 inputs for forward pass. But thanks for the feedback, we will review our tutorials for how to provide clarification and avoid confusion.
Looks like a pretty old thread, but thought I’d share the stack trace in case someone else chances upon this thread (or) has a similar problem later. Using
DeepSpeed v0.5.10
:This was happening because one of my loss calculations looked something like this:
The data types of all of these variables after these two steps were:
a.dtype
->torch.float16
b.dtype
->torch.float32
loss
->torch.float16
The
RuntimeError
was happening whenbackward()
was being called. To fix it convert botha
andb
to eithertorch.float16
ortorch.float32
(doing MSE loss calculations in FP32 might be a good idea since FP16 does sometime lead toNaN
s)