question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

examples/imagenet failing

See original GitHub issue

This is the diff of me taking examples/imagenet and modifying it to include dynamo:

diff --git a/imagenet/main.py b/imagenet/main.py
index e828ea0..3e47c92 100644
--- a/imagenet/main.py
+++ b/imagenet/main.py
@@ -7,6 +7,7 @@ import warnings
 from enum import Enum

 import torch
+import torch._dynamo as dynamo
 import torch.backends.cudnn as cudnn
 import torch.distributed as dist
 import torch.multiprocessing as mp
@@ -266,6 +267,8 @@ def main_worker(gpu, ngpus_per_node, args):
         val_dataset, batch_size=args.batch_size, shuffle=False,
         num_workers=args.workers, pin_memory=True, sampler=val_sampler)

+    model = dynamo.optimize()(model)
+
     if args.evaluate:
         validate(val_loader, model, criterion, args)
         return

This is the failure:

$ python main.py /home/soumith/dataset/imagenet
/home/soumith/code/vision/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension:
  warn(f"Failed to load image Python extension: {e}")
=> creating model 'resnet18'
WARNING:torch._inductor.lowering:make_fallback(aten.unfold): a decomposition exists, we should switch to it
WARNING:torch._inductor.lowering:make_fallback(aten.unfold_backward): a decomposition exists, we should switch to it
Epoch: [0][   1/5005]   Time 11.746 (11.746)    Data  2.157 ( 2.157)    Loss 7.0818e+00 (7.0818e+00)    Acc@1   0.00 (  0.00)   Acc@5   0.00 (  0.00)
Traceback (most recent call last):
  File "/home/soumith/code/examples/imagenet/main.py", line 514, in <module>
    main()
  File "/home/soumith/code/examples/imagenet/main.py", line 121, in main
    main_worker(args.gpu, ngpus_per_node, args)
  File "/home/soumith/code/examples/imagenet/main.py", line 281, in main_worker
    train(train_loader, model, criterion, optimizer, epoch, device, args)
  File "/home/soumith/code/examples/imagenet/main.py", line 328, in train
    output = model(images)
  File "/home/soumith/code/pytorch/torch/_dynamo/eval_frame.py", line 137, in __call__
    return self.forward(*args, **kwargs)
  File "/home/soumith/code/pytorch/torch/_dynamo/eval_frame.py", line 134, in forward
    return optimized_forward(*args, **kwargs)
  File "/home/soumith/code/pytorch/torch/_dynamo/eval_frame.py", line 157, in _fn
    return fn(*args, **kwargs)
  File "/home/soumith/code/pytorch/torch/nn/parallel/data_parallel.py", line 169, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/soumith/code/pytorch/torch/nn/modules/module.py", line 1357, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/soumith/code/vision/torchvision/models/resnet.py", line 284, in forward
    def forward(self, x: Tensor) -> Tensor:
  File "/home/soumith/code/pytorch/torch/_dynamo/eval_frame.py", line 157, in _fn
    return fn(*args, **kwargs)
  File "/home/soumith/code/pytorch/functorch/_src/aot_autograd.py", line 856, in forward
    return compiled_f(
  File "/home/soumith/code/pytorch/functorch/_src/aot_autograd.py", line 847, in new_func
    return compiled_fn(args)
  File "/home/soumith/code/pytorch/functorch/_src/aot_autograd.py", line 230, in g
    return f(*args)
  File "/home/soumith/code/pytorch/functorch/_src/aot_autograd.py", line 475, in compiled_function
    return CompiledFunction.apply(*remove_dupe_args(args))
RuntimeError: A view was created in no_grad mode and its base or another view of its base has been modified inplace with grad mode enabled. Given that this use case is ambiguous and error-prone, it is forbidden. You can clarify your code by moving both the view and the inplace either both inside the no_grad block (if you don't want the inplace to be tracked) or both outside (if you want the inplace to be tracked).

What works?

  • dynamo.optimize('eager')(model) works fine and runs without error.
  • dynamo.optimize(‘aot_eager’)(model)` also works fine and runs without error

Minifying

Here’s my attempt to minify and didn’t really see anything minified come out:

Attempt 1:

$ TORCHDYNAMO_REPRO_AFTER="dynamo" python main.py /home/soumith/dataset/imagenet
[exactly same output as above]

Attempt 2:

$ TORCHDYNAMO_REPRO_AFTER="aot" python main.py /home/soumith/dataset/imagenet
/home/soumith/code/vision/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension:
  warn(f"Failed to load image Python extension: {e}")
=> creating model 'resnet18'
WARNING:torch._inductor.lowering:make_fallback(aten.unfold): a decomposition exists, we should switch to it
WARNING:torch._inductor.lowering:make_fallback(aten.unfold_backward): a decomposition exists, we should switch to it
Epoch: [0][   1/5005]   Time 11.820 (11.820)    Data  1.918 ( 1.918)    Loss 6.9964e+00 (6.9964e+00)    Acc@1   0.00 (  0.00)   Acc@5   0.00 (  0.00)
WARNING:torch._dynamo.debug_utils:Writing minified repro to /tmp/minifier_soumith/minifier_launcher.py
WARNING:torch._dynamo.debug_utils:Copying minified repro from /tmp/minifier_soumith/minifier_launcher.py to /home/soumith/code/pytorch/minifier_launch
er.py for convenience
Traceback (most recent call last):
  File "/home/soumith/code/pytorch/torch/fx/interpreter.py", line 199, in placeholder
    return next(self.args_iter)
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/soumith/code/examples/imagenet/main.py", line 514, in <module>
    main()
  File "/home/soumith/code/examples/imagenet/main.py", line 121, in main
    main_worker(args.gpu, ngpus_per_node, args)
  File "/home/soumith/code/examples/imagenet/main.py", line 281, in main_worker
    train(train_loader, model, criterion, optimizer, epoch, device, args)
  File "/home/soumith/code/examples/imagenet/main.py", line 339, in train
    loss.backward()
  File "/home/soumith/code/pytorch/torch/_tensor.py", line 488, in backward
   torch.autograd.backward(
  File "/home/soumith/code/pytorch/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/soumith/code/pytorch/torch/autograd/function.py", line 270, in apply
    return user_fn(self, *args)
  File "/home/soumith/code/pytorch/functorch/_src/aot_autograd.py", line 468, in backward
    out = call_func_with_args(
  File "/home/soumith/code/pytorch/functorch/_src/aot_autograd.py", line 255, in call_func_with_args
    out = normalize_as_list(f(args))
  File "/home/soumith/code/pytorch/torch/_dynamo/eval_frame.py", line 157, in _fn
    return fn(*args, **kwargs)
  File "/home/soumith/code/pytorch/torch/_dynamo/debug_utils.py", line 444, in deferred_for_real_inputs
    raise e
  File "/home/soumith/code/pytorch/torch/_dynamo/debug_utils.py", line 428, in deferred_for_real_inputs
    compiled_fn = compiler_fn(gm, example_inputs, **kwargs)
  File "/home/soumith/code/pytorch/torch/_inductor/debug.py", line 178, in inner
    return fn(*args, **kwargs)
  File "/home/soumith/miniconda3/envs/pytorch/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/soumith/code/pytorch/torch/_inductor/compile_fx.py", line 106, in compile_fx_inner
    graph.run(*example_inputs)
  File "/home/soumith/code/pytorch/torch/_dynamo/utils.py", line 85, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/soumith/code/pytorch/torch/_inductor/graph.py", line 143, in run
    return super().run(*args)
  File "/home/soumith/code/pytorch/torch/fx/interpreter.py", line 130, in run
    self.env[node] = self.run_node(node)
  File "/home/soumith/code/pytorch/torch/_inductor/graph.py", line 311, in run_node
    result = super().run_node(n)
  File "/home/soumith/code/pytorch/torch/fx/interpreter.py", line 171, in run_node
    return getattr(self, n.op)(n.target, args, kwargs)
  File "/home/soumith/code/pytorch/torch/_inductor/graph.py", line 208, in placeholder
    example: torch.Tensor = super().placeholder(target, args, kwargs)
  File "/home/soumith/code/pytorch/torch/fx/interpreter.py", line 204, in placeholder
    raise RuntimeError(f'Expected positional argument for parameter {target}, but one was not passed in!')
RuntimeError: Expected positional argument for parameter primals_1, but one was not passed in!

While executing %primals_1 : [#users=1] = placeholder[target=primals_1]
Original traceback:
None

System Info

Latest PyTorch built from source, pinned Triton, CUDA 11.6, single GTX 3090.

Collecting environment information...
PyTorch version: 1.14.0a0+git054a2fd
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.24.1
Libc version: glibc-2.31

Python version: 3.9.13 (main, Oct 13 2022, 21:15:33)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-50-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.6.124
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 515.65.01
cuDNN version: Probably one of the following:
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.1
[pip3] torch==1.14.0a0+git054a2fd
[pip3] torchvision==0.15.0a0+f467349
[conda] blas                      1.0                         mkl

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
yanboliangcommented, Oct 16, 2022

I had encountered similar issue before. I tested this minimized example and found it has the same symptom: error only on GPU, CPU works well.

0reactions
ngimelcommented, Oct 17, 2022

as @anijain2305 said, this looks like a bug in cudagraphify in inductor (that won’t affect any of TORCHINDUCTOR_TRACE=1 dumps), as disabling cudagraphs fixes this minimum example also.

Read more comments on GitHub >

github_iconTop Results From Across the Web

examples\imagenet\models_test.py fails on GPU (8 x Titan-m)
Problem you have encountered: Running flax\examples\imagenet\models_test.py hits runtime error. What you expected to happen: Test to exit ...
Read more >
Exception: jetson.utils -- failed to create videoSource device
I am learning about Jetson Orin and trying to implement the AI Image classification following this tutorial with Jetson-Inference.
Read more >
Error with create_imagenet.sh - Google Groups
Hi, I have a problem with making lmdb of imagenet 2012 when i ran caffe/caffe-master$ . ... 0) mkdir examples/imagenet/ilsvrc12_train_lmdb failed
Read more >
Dataset Preparation for Caffe - python - Stack Overflow
It looks like your program is trying to create a directory /home/hashim/caffe/examples/imagenet/ilsvrc12_train_lmdb but it is failing.
Read more >
Using self trained VGG on **2 GPUs** (problem when loading it ...
Hello, I encounter a problem. I use examples/imagenet/main.py script to train a VGG (trained on 2 GPUs). Everything went fine for training, thus...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found