question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

torch.autograd.grad()

See original GitHub issue

I’m using a gradient penalty, something like the following:

y = model(x)
loss = some_loss_func(y)

gradients = torch.autograd.grad(
  outputs=y,
  inputs=x,
  grad_outputs=y.new_ones(y.size()),
  create_graph=True,
  retain_graph=True,
  only_inputs=True)[0]
gradients = gradients.view(gradients.size(0), -1)
penalty = (gradients.norm(2, dim=1) ** 2).mean()

loss += penalty
with amp.scale_loss(loss, self.optimizer) as scaled_loss:
  scaled_loss.backward()

This results in the error RuntimeError: expected type torch.cuda.FloatTensor but got torch.cuda.HalfTensor on the .backward() call, in either O1 or O2 mode (but not O0 or O3). When I remove the gradient penalty, the code runs fine in all modes. I’m running on a single GPU.

Is this expected? If so, is there a suggested alternative?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:8 (2 by maintainers)

github_iconTop GitHub Comments

4reactions
mcarillicommented, Mar 22, 2019

I think I know what’s happening. This is a nice one… With both O1 and O2, batchnorm weights are kept in FP32, which is a requirement to enable cudnn batchnorm. In O1 batchnorm weights remain FP32 because all weights remain FP32. In O2 batchnorm weights remain FP32 because we explicitly special-case keeping batchnorm weights in FP32, while the rest of the model weights are cast to FP16. Cudnn batchnorm forward can handle FP16 inputs+FP32 weights without trouble, and cudnn batchnorm backward can handle FP16 incoming gradients+FP32 weights without trouble. However, when a backward pass with create_graph=True is underway, Pytorch falls back to a non-cudnn (native) implementation of batchnorm backward that is double-differentiable. This native backward implementation CANNOT handle a combination of FP16 incoming gradients + FP32 weights, which (I suspect) causes your error.

There are a couple of approaches that might help here. With O1, you can try registering batchnorm as blacklist function, which will ensure its inputs and outputs (and therefore its incoming gradients during backward) are cast to FP32:

amp.register_float_function(torch, 'batch_norm')
model, optimizer = amp.initialize(model. optimizer, opt_level="O1")

Alternatively, with O2, you can work around by supplying the override keep_batchnorm_fp32=False, but this is less safe numerically imo.

1reaction
sunshineInmooncommented, Mar 26, 2019

@mcarilli Thanks for your analysis! It’s OK for my work.

Read more comments on GitHub >

github_iconTop Results From Across the Web

torch.autograd.grad — PyTorch 1.13 documentation
Computes and returns the sum of gradients of outputs with respect to the inputs. grad_outputs should be a sequence of length matching output...
Read more >
Autograd.grad() for Tensor in pytorch - Stack Overflow
I get errors like: “RunTimeerror: grad can be implicitly created only for scalar outputs” . What should be the inputs in torch.autograd.grad() ......
Read more >
Get different gradients by torch.autograd.grad and ... - GitHub
import torch class Net(torch.nn.Module): def __init__(self, dim = [1,20,1]): super(Net, self).__init__() self._net = FCN(dim[0],dim[1] ...
Read more >
Automatic differentiation package - torch.autograd
torch.autograd provides classes and functions implementing automatic differentiation of arbitrary scalar valued functions. It requires minimal changes to the ...
Read more >
dlc-slides-4-2-autograd.pdf - fleuret.org
torch.autograd.grad(outputs, inputs) computes and returns the gradient ... The function Tensor.backward() accumulates gradients in the grad fields of.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found