torch.autograd.grad()
See original GitHub issueI’m using a gradient penalty, something like the following:
y = model(x)
loss = some_loss_func(y)
gradients = torch.autograd.grad(
outputs=y,
inputs=x,
grad_outputs=y.new_ones(y.size()),
create_graph=True,
retain_graph=True,
only_inputs=True)[0]
gradients = gradients.view(gradients.size(0), -1)
penalty = (gradients.norm(2, dim=1) ** 2).mean()
loss += penalty
with amp.scale_loss(loss, self.optimizer) as scaled_loss:
scaled_loss.backward()
This results in the error RuntimeError: expected type torch.cuda.FloatTensor but got torch.cuda.HalfTensor
on the .backward()
call, in either O1 or O2 mode (but not O0 or O3). When I remove the gradient penalty, the code runs fine in all modes. I’m running on a single GPU.
Is this expected? If so, is there a suggested alternative?
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (2 by maintainers)
Top Results From Across the Web
torch.autograd.grad — PyTorch 1.13 documentation
Computes and returns the sum of gradients of outputs with respect to the inputs. grad_outputs should be a sequence of length matching output...
Read more >Autograd.grad() for Tensor in pytorch - Stack Overflow
I get errors like: “RunTimeerror: grad can be implicitly created only for scalar outputs” . What should be the inputs in torch.autograd.grad() ......
Read more >Get different gradients by torch.autograd.grad and ... - GitHub
import torch class Net(torch.nn.Module): def __init__(self, dim = [1,20,1]): super(Net, self).__init__() self._net = FCN(dim[0],dim[1] ...
Read more >Automatic differentiation package - torch.autograd
torch.autograd provides classes and functions implementing automatic differentiation of arbitrary scalar valued functions. It requires minimal changes to the ...
Read more >dlc-slides-4-2-autograd.pdf - fleuret.org
torch.autograd.grad(outputs, inputs) computes and returns the gradient ... The function Tensor.backward() accumulates gradients in the grad fields of.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I think I know what’s happening. This is a nice one… With both O1 and O2, batchnorm weights are kept in FP32, which is a requirement to enable cudnn batchnorm. In O1 batchnorm weights remain FP32 because all weights remain FP32. In O2 batchnorm weights remain FP32 because we explicitly special-case keeping batchnorm weights in FP32, while the rest of the model weights are cast to FP16. Cudnn batchnorm forward can handle FP16 inputs+FP32 weights without trouble, and cudnn batchnorm backward can handle FP16 incoming gradients+FP32 weights without trouble. However, when a backward pass with create_graph=True is underway, Pytorch falls back to a non-cudnn (native) implementation of batchnorm backward that is double-differentiable. This native backward implementation CANNOT handle a combination of FP16 incoming gradients + FP32 weights, which (I suspect) causes your error.
There are a couple of approaches that might help here. With O1, you can try registering batchnorm as blacklist function, which will ensure its inputs and outputs (and therefore its incoming gradients during backward) are cast to FP32:
Alternatively, with O2, you can work around by supplying the override
keep_batchnorm_fp32=False
, but this is less safe numerically imo.@mcarilli Thanks for your analysis! It’s OK for my work.