BatchNorm RuntimeError: expected scalar type Half but found Float
See original GitHub issueThanks for all your great work. I’ve found that trying to keep batch norm in fp32 results in a RuntimeError. Here is the minimum example:
device = torch.device("cuda:%d"%(0) if torch.cuda.is_available() else "cpu")
criterion = CustomLoss(device=device)
model = MyModel().to(device)
dataloaders = ...
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=args.lr, eps=1e-8)
model, optimizer = amp.initialize(model, optimizer, opt_level='O1')
dataiter = iter(dataloaders['train'])
images = next(dataiter)
images = images.to(device)
outputs = model(images)
loss = criterion(images, outputs)
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
Here is the error message:
python testing_apex.py
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
torch.Size([2, 1, 320, 640]) torch.Size([2, 1, 320, 640])
Traceback (most recent call last):
File "testing_apex.py", line 205, in <module>
disparities,invalidations = model(left, right)
File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/jim/Documents/python/stereo_model_zoo/models/activestereonet.py", line 355, in forward
left_feats = self.feature_extractor(left)
File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/jim/Documents/python/stereo_model_zoo/models/activestereonet.py", line 97, in forward
x = self.preprocessor(x)
File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/jim/Documents/python/stereo_model_zoo/models/activestereonet.py", line 33, in forward
out = self.bn1(out)
File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 83, in forward
exponential_average_factor, self.eps)
File "/home/jim/anaconda3/envs/pt3/lib/python3.7/site-packages/torch/nn/functional.py", line 1697, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: expected scalar type Half but found Float
If I use opt_level='O1'
, I get the error. If I use opt_level='O3', keep_batchnorm_fp32=True
, I get the error. If I use opt_level='O3', keep_batchnorm_fp32=False
, everything works fine (except that training results in nan losses, which is apparently to be expected from ‘pure’ fp16 training).
Information about system:
python --version = 3.7.3
nvcc --version release 10.0, V10.0.130
torch.__version__ = '1.1.0a0+95ce796'
Amp downloaded and installed today, May 13: commit = 4ff153cd50e4533b21dc1fd97c0ed609e19c4042
Thanks for your help!
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (3 by maintainers)
Top Results From Across the Web
RuntimeError: Expected object of scalar type Float but got ...
When the error is RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #4 'mat1' , you would...
Read more >expected scalar type Float but found Half in deform_conv2d ...
Hi, I have a huge network with some Deformable CNNs (torch.ops.torchvision.deform_conv2d) in it. I am using apex 0.1, Cuda 11, and torch ...
Read more >pytorch runtimeerror: expected scalar type long but found float - You ...
I'm trying to create a model with PyTorch but during the forward, I encounter this issue "RuntimeError: expected scalar type Long but found...
Read more >Pytorch Error, Runtimeerror: Expected Scalar Type Long But ...
BatchNorm RuntimeError: expected scalar type Half but found Float #301 automatic casts around Pytorch functions and Tensor methods.
Read more >expected scalar type Double but found Float_镇长1998的博客
RuntimeError : expected scalar type Double but found Float原因: tensor的数据类型dtype不正确解决: 将数据类型转 ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@jbohnslav @mcarilli I could reproduce this error by disabling cuDNN (using
torch.backends.cudnn.enabled = False
).@jbohnslav Could this be the issue? Are you disabling cuDNN (accidentally)?
@mcarilli,
Sure, I ran a few quick tests. I have a rather complex loss function with many components, therefore I’m listing training speed and inference speed. Training has forward pass, loss computation, backward pass, and optimizer step. Inference has only forward pass. Note that with
opt_level='O3', keep_batchnorm_fp32=True
, losses becamenan
.I’m using a Titan RTX GPU. It seems like most of the speedup was just fixing my cudnn. There’s a ~20% increase in speed with half precision during inference, but a ~50% slowdown during training. This isn’t the ultimate speedup that I’d hoped for with Tensor Cores. This is outside the scope of this issue, however.