question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scalar Type Error when enable FP16 (`RuntimeError: Expected object of scalar type Half...`)

See original GitHub issue

When using deepspeed 0.3 / 0.3.2 with FP16 enabled I got a scalar type error RuntimeError: Expected object of scalar type Half but got scalar type Float for argument #2 'mat1' in call to _th_addmm

In my previous projects using deepspeed 0.1 / 0.2, I’ve never met this error. Anyone has any ideas?

Thanks in advance!


Following is the demo script

import os
import argparse
import socket
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset
import deepspeed


def parse_args():
    parser = argparse.ArgumentParser(description='DeepSpeed ZeRO demo')
    parser.add_argument("--local_rank", type=int)
    parser = deepspeed.add_config_arguments(parser)
    return parser.parse_args()


def gen_data():
    inps = torch.arange(1024 * 10, dtype=torch.float32).view(1024, 10)
    tgts = torch.arange(1024 * 5, dtype=torch.float32).view(1024, 5)
    return TensorDataset(inps, tgts)


class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        x = self.net1(x)
        return self.net2(self.relu(x))


def run(model_engine, train_loader):
    loss_fn = nn.MSELoss()
    model_engine.train()
    for i, batch in enumerate(train_loader):
        x = batch[0].to(model_engine.local_rank)
        y = batch[1].to(model_engine.local_rank)
        outputs = model_engine(x)
        model_engine.backward(loss_fn(outputs, y))
        model_engine.step()


def demo_zero(config):
    print(f'Running ZeRO example on local_rank {config.local_rank}.')
    model = ToyModel()
    dataset = gen_data()
    model_engine, optimizer, train_loader, lr_scheduler = deepspeed.initialize(
        args=config, model=model, model_parameters=model.parameters(),
        training_data=dataset)
    run(model_engine, train_loader)
    print(f'Local_rank {config.local_rank} job finished')


if __name__ == "__main__":
    args = parse_args()
    demo_zero(args)

Deepspeed configuration

{
  "train_batch_size": 8,
  "gradient_accumulation_steps": 1,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00015
    }
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 0,
    "allgather_partitions": true,
    "allgather_bucket_size": 500000000,
    "overlap_comm": false,
    "reduce_scatter": true,
    "reduce_bucket_size": 500000000,
    "contiguous_gradients" : false,
    "cpu_offload": false
    }
}

ds_report outputs

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
/bin/sh: line 0: type: llvm-config-9: not found
 [WARNING]  sparse_attn requires a torch version >= 1.5 but detected 1.4
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/*/packages/conda/miniconda3/envs/torch1p4/lib/python3.6/site-packages/torch']
torch version .................... 1.4.0
torch cuda version ............... 10.1
nvcc version ..................... 10.1
deepspeed install path ........... ['/*/packages/conda/miniconda3/envs/torch1p4/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.3.2, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.4, cuda 10.1

The error message

Traceback (most recent call last):
  File "/*/202003_ZERO_Test/ddl_introduction/ds_demo_zero.py", line 73, in <module>
    demo_zero(args)
  File "/*/202003_ZERO_Test/ddl_introduction/ds_demo_zero.py", line 67, in demo_zero
    run(model_engine, train_loader)
  File "/*/202003_ZERO_Test/ddl_introduction/ds_demo_zero.py", line 53, in run
    outputs = model_engine(x)
  File "/*/packages/conda/miniconda3/envs/torch1p4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/*/packages/conda/miniconda3/envs/torch1p4/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 796, in forward
    loss = self.module(*inputs, **kwargs)
  File "/*/packages/conda/miniconda3/envs/torch1p4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/*/202003_ZERO_Test/ddl_introduction/ds_demo_zero.py", line 43, in forward
    x = self.net1(x)
  File "/*/packages/conda/miniconda3/envs/torch1p4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/*/packages/conda/miniconda3/envs/torch1p4/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/*/packages/conda/miniconda3/envs/torch1p4/lib/python3.6/site-packages/torch/nn/functional.py", line 1370, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: Expected object of scalar type Half but got scalar type Float for argument #2 'mat1' in call to _th_addmm

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
tjruwasecommented, Dec 2, 2020

@Yicheng-G, auto-casting can be tricky since it can expose the user to silent precision issues. We have generally assumed that when fp16 or mixed precision training is enabled, that users will take care to prepare fp16 inputs for forward pass. But thanks for the feedback, we will review our tutorials for how to provide clarification and avoid confusion.

0reactions
rohitdwivedulacommented, Feb 26, 2022

Looks like a pretty old thread, but thought I’d share the stack trace in case someone else chances upon this thread (or) has a similar problem later. Using DeepSpeed v0.5.10:

File "train.py", line 173, in train
  student_model.backward(loss)
File "/home/gandiva/rohitd/.dvenv/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 1697, in backward
  self.optimizer.backward(loss)
File "/home/gandiva/rohitd/.dvenv/lib/python3.6/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1910, in backward
  self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/gandiva/rohitd/.dvenv/lib/python3.6/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward
  scaled_loss.backward(retain_graph=retain_graph)
File "/home/gandiva/rohitd/.dvenv/lib/python3.6/site-packages/torch/tensor.py", line 245, in backward
  torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/gandiva/rohitd/.dvenv/lib/python3.6/site-packages/torch/autograd/__init__.py", line 147, in backward
  allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: Found dtype Float but expected Half

This was happening because one of my loss calculations looked something like this:

mse_loss_fct = torch.nn.MSELoss(reduction="mean")
loss_4 = mse_loss_fct(a, b)

The data types of all of these variables after these two steps were:

  • a.dtype -> torch.float16
  • b.dtype -> torch.float32
  • loss -> torch.float16

The RuntimeError was happening when backward() was being called. To fix it convert both a and b to either torch.float16 or torch.float32 (doing MSE loss calculations in FP32 might be a good idea since FP16 does sometime lead to NaNs)

Read more comments on GitHub >

github_iconTop Results From Across the Web

expected scalar type Half but found Float" when using fp16
as no gpu - 'Cuda' available. if i set torch_dtype=torch.float16, thn it throws RuntimeError: expected scalar type Float but found BFloat16. if ...
Read more >
RuntimeError: expected scalar type Float but found Half
I tried to decorate all the forward passes in the subsequent functions with torch.cuda.amp.autocast(enabled=True) but the error persists.
Read more >
RuntimeError: expected scalar type Float but found Long ...
RuntimeError : Expected object of scalar type Long but got scalar type Float for argument #2 'target' in call to _thnn_nll_loss_forward on line ......
Read more >
Pytorch Error, Runtimeerror: Expected Scalar Type Long But ...
When using deepspeed 0.3 / 0.3.2 with FP16 enabled I got a scalar type error RuntimeError: Expected object of scalar type Half but...
Read more >
How to fix RuntimeError "Expected object of scalar type Float ...
PYTHON : How to fix RuntimeError " Expected object of scalar type Float but got scalar type Double for argument"?
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found