question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BatchNorm RuntimeError: expected scalar type Half but found Float

See original GitHub issue

Thanks for all your great work. I’ve found that trying to keep batch norm in fp32 results in a RuntimeError. Here is the minimum example:


device = torch.device("cuda:%d"%(0) if torch.cuda.is_available() else "cpu")

criterion = CustomLoss(device=device)
model = MyModel().to(device)
dataloaders = ...
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=args.lr, eps=1e-8)
model, optimizer = amp.initialize(model, optimizer, opt_level='O1')
dataiter = iter(dataloaders['train'])
images = next(dataiter)
images = images.to(device)
outputs = model(images)
loss = criterion(images, outputs)
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

Here is the error message:

python testing_apex.py 
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
torch.Size([2, 1, 320, 640]) torch.Size([2, 1, 320, 640])
Traceback (most recent call last):
  File "testing_apex.py", line 205, in <module>
    disparities,invalidations = model(left, right)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jim/Documents/python/stereo_model_zoo/models/activestereonet.py", line 355, in forward
    left_feats = self.feature_extractor(left)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jim/Documents/python/stereo_model_zoo/models/activestereonet.py", line 97, in forward
    x = self.preprocessor(x)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jim/Documents/python/stereo_model_zoo/models/activestereonet.py", line 33, in forward
    out = self.bn1(out)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 83, in forward
    exponential_average_factor, self.eps)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/functional.py", line 1697, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: expected scalar type Half but found Float

If I use opt_level='O1', I get the error. If I use opt_level='O3', keep_batchnorm_fp32=True, I get the error. If I use opt_level='O3', keep_batchnorm_fp32=False, everything works fine (except that training results in nan losses, which is apparently to be expected from ‘pure’ fp16 training).

Information about system: python --version = 3.7.3 nvcc --version release 10.0, V10.0.130 torch.__version__ = '1.1.0a0+95ce796' Amp downloaded and installed today, May 13: commit = 4ff153cd50e4533b21dc1fd97c0ed609e19c4042

Thanks for your help!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:11 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
ptrblckcommented, May 14, 2019

@jbohnslav @mcarilli I could reproduce this error by disabling cuDNN (using torch.backends.cudnn.enabled = False).

@jbohnslav Could this be the issue? Are you disabling cuDNN (accidentally)?

1reaction
jbohnslavcommented, May 14, 2019

@mcarilli,

Sure, I ran a few quick tests. I have a rather complex loss function with many components, therefore I’m listing training speed and inference speed. Training has forward pass, loss computation, backward pass, and optimizer step. Inference has only forward pass. Note that with opt_level='O3', keep_batchnorm_fp32=True, losses became nan.

Experiment Train speed (fps) Inference speed (fps)
fp32, cudnn=False 13.19 25.09
fp32, cudnn=True 22.81 50.87
fp16 (opt_level ‘O1’), cudnn = True 10.30 63.91
fp16 (opt_level ‘O2’), cudnn = True 10.14 64.32
fp16 (opt_level ‘O3’, keep_batchnorm_fp32=True), cudnn=True 23.96 65.02

I’m using a Titan RTX GPU. It seems like most of the speedup was just fixing my cudnn. There’s a ~20% increase in speed with half precision during inference, but a ~50% slowdown during training. This isn’t the ultimate speedup that I’d hoped for with Tensor Cores. This is outside the scope of this issue, however.

Read more comments on GitHub >

github_iconTop Results From Across the Web

RuntimeError: Expected object of scalar type Float but got ...
When the error is RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #4 'mat1' , you would...
Read more >
expected scalar type Float but found Half in deform_conv2d ...
Hi, I have a huge network with some Deformable CNNs (torch.ops.torchvision.deform_conv2d) in it. I am using apex 0.1, Cuda 11, and torch ...
Read more >
pytorch runtimeerror: expected scalar type long but found float - You ...
I'm trying to create a model with PyTorch but during the forward, I encounter this issue "RuntimeError: expected scalar type Long but found...
Read more >
Pytorch Error, Runtimeerror: Expected Scalar Type Long But ...
BatchNorm RuntimeError: expected scalar type Half but found Float #301 automatic casts around Pytorch functions and Tensor methods.
Read more >
expected scalar type Double but found Float_镇长1998的博客
RuntimeError : expected scalar type Double but found Float原因: tensor的数据类型dtype不正确解决: 将数据类型转 ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found