Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BatchNorm RuntimeError: expected scalar type Half but found Float

See original GitHub issue

Thanks for all your great work. I’ve found that trying to keep batch norm in fp32 results in a RuntimeError. Here is the minimum example:


device = torch.device("cuda:%d"%(0) if torch.cuda.is_available() else "cpu")

criterion = CustomLoss(device=device)
model = MyModel().to(device)
dataloaders = ...
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=args.lr, eps=1e-8)
model, optimizer = amp.initialize(model, optimizer, opt_level='O1')
dataiter = iter(dataloaders['train'])
images = next(dataiter)
images = images.to(device)
outputs = model(images)
loss = criterion(images, outputs)
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

Here is the error message:

python testing_apex.py 
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
torch.Size([2, 1, 320, 640]) torch.Size([2, 1, 320, 640])
Traceback (most recent call last):
  File "testing_apex.py", line 205, in <module>
    disparities,invalidations = model(left, right)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jim/Documents/python/stereo_model_zoo/models/activestereonet.py", line 355, in forward
    left_feats = self.feature_extractor(left)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jim/Documents/python/stereo_model_zoo/models/activestereonet.py", line 97, in forward
    x = self.preprocessor(x)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jim/Documents/python/stereo_model_zoo/models/activestereonet.py", line 33, in forward
    out = self.bn1(out)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 83, in forward
    exponential_average_factor, self.eps)
  File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/functional.py", line 1697, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: expected scalar type Half but found Float

If I use opt_level='O1', I get the error. If I use opt_level='O3', keep_batchnorm_fp32=True, I get the error. If I use opt_level='O3', keep_batchnorm_fp32=False, everything works fine (except that training results in nan losses, which is apparently to be expected from ‘pure’ fp16 training).

Information about system: python --version = 3.7.3 nvcc --version release 10.0, V10.0.130 torch.__version__ = '1.1.0a0+95ce796' Amp downloaded and installed today, May 13: commit = 4ff153cd50e4533b21dc1fd97c0ed609e19c4042

Thanks for your help!

Issue Analytics

State:
Created 4 years ago
Comments:11 (3 by maintainers)

Top GitHub Comments

3reactions

ptrblckcommented, May 14, 2019

@jbohnslav @mcarilli I could reproduce this error by disabling cuDNN (using torch.backends.cudnn.enabled = False).

@jbohnslav Could this be the issue? Are you disabling cuDNN (accidentally)?

1reaction

jbohnslavcommented, May 14, 2019

@mcarilli,

Sure, I ran a few quick tests. I have a rather complex loss function with many components, therefore I’m listing training speed and inference speed. Training has forward pass, loss computation, backward pass, and optimizer step. Inference has only forward pass. Note that with opt_level='O3', keep_batchnorm_fp32=True, losses became nan.

Experiment	Train speed (fps)	Inference speed (fps)
fp32, cudnn=False	13.19	25.09
fp32, cudnn=True	22.81	50.87
fp16 (opt_level ‘O1’), cudnn = True	10.30	63.91
fp16 (opt_level ‘O2’), cudnn = True	10.14	64.32
fp16 (opt_level ‘O3’, keep_batchnorm_fp32=True), cudnn=True	23.96	65.02

I’m using a Titan RTX GPU. It seems like most of the speedup was just fixing my cudnn. There’s a ~20% increase in speed with half precision during inference, but a ~50% slowdown during training. This isn’t the ultimate speedup that I’d hoped for with Tensor Cores. This is outside the scope of this issue, however.