question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Simple training script with loss cause an exception

See original GitHub issue

The following steps causes a runtime exception NotImplementedError. But, I am sorry in advance if I make mistakes. Here is a repro. "eagar" or "aot_eagar" instead of “inductor” does not work, too.

% $ git log | head -1
commit 8c9f11ca6f2789f06785e7606bdb99f087bcc73a
% pip list | grep torch
torch              1.13.0.dev20220927+cpu
torch-mlir         20220927.609
torchdynamo        1.13.0.dev0            /mnt/xvdc/DeepTools/DD2/torchdynamo
torchvision        0.14.0.dev20220927+cpu
% cat torchdynamo_loss.py
import torch
import torchdynamo

class Net(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x, y):
        x = torch.bmm(x, y)
        x = torch.flatten(x, 1)
        return x

@torchdynamo.optimize("inductor")
def training_iter_fn(batch, model, optimizer):
    optimizer.zero_grad()
    out = model(**batch)
    lossFn = torch.nn.CrossEntropyLoss()
    target = torch.tensor([0, 7])
    loss = lossFn(out, target)
    loss.backward()
    optimizer.step()
    return loss

net = Net()
input1 = torch.randn(2, 1, 4)
input2 = torch.randn(2, 4, 8, requires_grad=True)
optimizer = torch.optim.Adam([input2], lr=0.1)

opt_training_iter_fn = training_iter_fn
batch = {"x":input1, "y":input2}
loss = opt_training_iter_fn(batch, net, optimizer)

print(input2.cpu())


% python torchdynamo_loss.py 
[2022-10-12 06:42:13,178] torchdynamo.variables.torch: [WARNING] Profiler will be ignored
[2022-10-12 06:42:13,187] torchdynamo.symbolic_convert: [WARNING] Graph break: Tensor.backward from user code at   File "/home/ishizaki/torchdynamo/tmp/torchdynamo_loss.py", line 26, in training_iter_fn
    loss.backward()

[2022-10-12 06:42:14,922] torchdynamo.symbolic_convert: [WARNING] Graph break: inline with __closure__ from user code at   File "/home/ishizaki/torchdynamo/tmp/torchdynamo_loss.py", line 27, in <graph break in training_iter_fn>
    optimizer.step()

[2022-10-12 06:42:14,934] torchdynamo.symbolic_convert: [WARNING] Graph break: inline in skipfiles: _fn /home/ishizaki/torchdynamo/torchdynamo/eval_frame.py from user code at   File "/home/ishizaki/torchdynamo/.venv/lib/python3.9/site-packages/torch/optim/adam.py", line 178, in step
    self._cuda_graph_capture_health_check()

[2022-10-12 06:42:14,948] torchdynamo.convert_frame: [ERROR] WON'T CONVERT <graph break in step> /home/ishizaki/torchdynamo/.venv/lib/python3.9/site-packages/torch/optim/adam.py line 178 
due to: 
Traceback (most recent call last):
  File "/home/ishizaki/torchdynamo/torchdynamo/variables/base.py", line 146, in as_python_constant
    raise NotImplementedError(f"{self} is not a constant")
NotImplementedError: TensorVariable() is not a constant

from user code:
   File "/home/ishizaki/torchdynamo/.venv/lib/python3.9/site-packages/torch/optim/adam.py", line 209, in <graph break in step>
    state = self.state[p]

Set torchdynamo.config.verbose=True for more information
==========
[2022-10-12 06:42:14,961] torchdynamo.symbolic_convert: [WARNING] Graph break: Tensor.item from user code at   File "/home/ishizaki/torchdynamo/.venv/lib/python3.9/site-packages/torch/optim/adam.py", line 300, in adam
    func(params,
  File "/home/ishizaki/torchdynamo/.venv/lib/python3.9/site-packages/torch/optim/adam.py", line 395, in _single_tensor_adam
    step = step_t.item()

[2022-10-12 06:42:14,980] torchdynamo.optimizations.training: [WARNING] Unable to use Aot Autograd because of presence of mutation
[2022-10-12 06:42:14,980] torchinductor.compile_fx: [WARNING] Aot Autograd is not safe to run, so falling back to eager
[2022-10-12 06:42:15,000] torchdynamo.optimizations.training: [WARNING] Unable to use Aot Autograd because of presence of mutation
[2022-10-12 06:42:15,000] torchinductor.compile_fx: [WARNING] Aot Autograd is not safe to run, so falling back to eager
tensor([[[ 1.2348, -0.7126,  1.2387,  2.0989, -0.1772,  0.4236, -0.0968,
          -0.3471],
         [ 0.6295, -1.5259,  0.6826,  1.0028, -0.4873,  0.0893, -0.2904,
           0.1033],
         [ 1.7367,  0.7345,  1.5238, -1.9054, -1.9447,  0.4717,  0.1325,
          -0.6108],
         [ 0.5933,  0.7319,  1.5816, -0.3573,  0.3974, -1.0648, -2.0550,
           0.6247]],

        [[ 0.4393,  0.4159, -0.4996,  0.3288, -0.9796, -0.0822, -0.6735,
           0.4048],
         [-1.1754, -0.2157,  1.0433, -0.3781,  0.5304, -2.7421, -1.1731,
          -0.6624],
         [ 0.3439, -0.4731,  0.4820, -0.1286, -0.1511,  0.4843,  1.1936,
           1.2146],
         [ 1.9118, -1.4318, -0.6035,  0.0142,  0.8406,  1.2690, -0.2417,
           0.4326]]], requires_grad=True)

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
yanboliangcommented, Oct 13, 2022

I wanted to provide context that this pattern of optimizing a Tensor with requries_grad=True (instead of an nn.Parameter) is valid. This is a pattern of doing gradient descent on the inputs (instead of weights), and this is used in a variety of neural network applications, including adversarial examples, reconstruction models, a variety of interpretability research including PyTorch’s own https://captum.ai/ package.

Thanks for providing the context, it makes sense! @soumith Currently we use defaultdict and Tensor as key across all optimizers(including Adam in this case). We have to specialize the Tensor’s value during compilation if its was used as key. Right now we did specialization only if it’s nn.Parameter, which works well for majority cases. If we want to support optimizing over any Tensor with requries_grad=True, we just need to relax this constriction. But the downside is it may take more memory if we specialize these non-parameter tensors. As you mentioned, this is a bit rarer, so we need to trade off. Anyway, I’ll send a PR soon.

1reaction
kiszkcommented, Oct 13, 2022

Yeah, this code works well w/o torchdynamo. Let me double-check Adam API.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training successful but submitting function throws exception
Describe the bug. Training a simple NN on sagemaker via tensorflow estimator. things go ok if I dont use callbacks to create checkpoints....
Read more >
How to Handle Exception in SoapUI Groovy Scripts
It can happen due to many reasons such as invalid data, network connection loss, trying open files that are unavailable, accessing invalid class ......
Read more >
Pytorch: IndexError: index out of range in self. How to solve?
`loss` is a Tensor containing a # single value; the `. item()` function just returns the Python value # from the tensor. total_loss...
Read more >
How to fix Python KeyError Exceptions in simple steps?
A detailed guide to Errors and Exceptions in Python. ... The condition does not raise an exception; rather it terminates the program.
Read more >
Error Handling in VBA - My Online Training Hub
Understand how Excel VBA generates errors, how to control what Excel does when an error occurs, and how to write your own error...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found