examples/imagenet failing
See original GitHub issueThis is the diff of me taking examples/imagenet and modifying it to include dynamo
:
diff --git a/imagenet/main.py b/imagenet/main.py
index e828ea0..3e47c92 100644
--- a/imagenet/main.py
+++ b/imagenet/main.py
@@ -7,6 +7,7 @@ import warnings
from enum import Enum
import torch
+import torch._dynamo as dynamo
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.multiprocessing as mp
@@ -266,6 +267,8 @@ def main_worker(gpu, ngpus_per_node, args):
val_dataset, batch_size=args.batch_size, shuffle=False,
num_workers=args.workers, pin_memory=True, sampler=val_sampler)
+ model = dynamo.optimize()(model)
+
if args.evaluate:
validate(val_loader, model, criterion, args)
return
This is the failure:
$ python main.py /home/soumith/dataset/imagenet
/home/soumith/code/vision/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension:
warn(f"Failed to load image Python extension: {e}")
=> creating model 'resnet18'
WARNING:torch._inductor.lowering:make_fallback(aten.unfold): a decomposition exists, we should switch to it
WARNING:torch._inductor.lowering:make_fallback(aten.unfold_backward): a decomposition exists, we should switch to it
Epoch: [0][ 1/5005] Time 11.746 (11.746) Data 2.157 ( 2.157) Loss 7.0818e+00 (7.0818e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00)
Traceback (most recent call last):
File "/home/soumith/code/examples/imagenet/main.py", line 514, in <module>
main()
File "/home/soumith/code/examples/imagenet/main.py", line 121, in main
main_worker(args.gpu, ngpus_per_node, args)
File "/home/soumith/code/examples/imagenet/main.py", line 281, in main_worker
train(train_loader, model, criterion, optimizer, epoch, device, args)
File "/home/soumith/code/examples/imagenet/main.py", line 328, in train
output = model(images)
File "/home/soumith/code/pytorch/torch/_dynamo/eval_frame.py", line 137, in __call__
return self.forward(*args, **kwargs)
File "/home/soumith/code/pytorch/torch/_dynamo/eval_frame.py", line 134, in forward
return optimized_forward(*args, **kwargs)
File "/home/soumith/code/pytorch/torch/_dynamo/eval_frame.py", line 157, in _fn
return fn(*args, **kwargs)
File "/home/soumith/code/pytorch/torch/nn/parallel/data_parallel.py", line 169, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/soumith/code/pytorch/torch/nn/modules/module.py", line 1357, in _call_impl
return forward_call(*input, **kwargs)
File "/home/soumith/code/vision/torchvision/models/resnet.py", line 284, in forward
def forward(self, x: Tensor) -> Tensor:
File "/home/soumith/code/pytorch/torch/_dynamo/eval_frame.py", line 157, in _fn
return fn(*args, **kwargs)
File "/home/soumith/code/pytorch/functorch/_src/aot_autograd.py", line 856, in forward
return compiled_f(
File "/home/soumith/code/pytorch/functorch/_src/aot_autograd.py", line 847, in new_func
return compiled_fn(args)
File "/home/soumith/code/pytorch/functorch/_src/aot_autograd.py", line 230, in g
return f(*args)
File "/home/soumith/code/pytorch/functorch/_src/aot_autograd.py", line 475, in compiled_function
return CompiledFunction.apply(*remove_dupe_args(args))
RuntimeError: A view was created in no_grad mode and its base or another view of its base has been modified inplace with grad mode enabled. Given that this use case is ambiguous and error-prone, it is forbidden. You can clarify your code by moving both the view and the inplace either both inside the no_grad block (if you don't want the inplace to be tracked) or both outside (if you want the inplace to be tracked).
What works?
dynamo.optimize('eager')(model)
works fine and runs without error.- dynamo.optimize(‘aot_eager’)(model)` also works fine and runs without error
Minifying
Here’s my attempt to minify and didn’t really see anything minified come out:
Attempt 1:
$ TORCHDYNAMO_REPRO_AFTER="dynamo" python main.py /home/soumith/dataset/imagenet
[exactly same output as above]
Attempt 2:
$ TORCHDYNAMO_REPRO_AFTER="aot" python main.py /home/soumith/dataset/imagenet
/home/soumith/code/vision/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension:
warn(f"Failed to load image Python extension: {e}")
=> creating model 'resnet18'
WARNING:torch._inductor.lowering:make_fallback(aten.unfold): a decomposition exists, we should switch to it
WARNING:torch._inductor.lowering:make_fallback(aten.unfold_backward): a decomposition exists, we should switch to it
Epoch: [0][ 1/5005] Time 11.820 (11.820) Data 1.918 ( 1.918) Loss 6.9964e+00 (6.9964e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00)
WARNING:torch._dynamo.debug_utils:Writing minified repro to /tmp/minifier_soumith/minifier_launcher.py
WARNING:torch._dynamo.debug_utils:Copying minified repro from /tmp/minifier_soumith/minifier_launcher.py to /home/soumith/code/pytorch/minifier_launch
er.py for convenience
Traceback (most recent call last):
File "/home/soumith/code/pytorch/torch/fx/interpreter.py", line 199, in placeholder
return next(self.args_iter)
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/soumith/code/examples/imagenet/main.py", line 514, in <module>
main()
File "/home/soumith/code/examples/imagenet/main.py", line 121, in main
main_worker(args.gpu, ngpus_per_node, args)
File "/home/soumith/code/examples/imagenet/main.py", line 281, in main_worker
train(train_loader, model, criterion, optimizer, epoch, device, args)
File "/home/soumith/code/examples/imagenet/main.py", line 339, in train
loss.backward()
File "/home/soumith/code/pytorch/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/home/soumith/code/pytorch/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/soumith/code/pytorch/torch/autograd/function.py", line 270, in apply
return user_fn(self, *args)
File "/home/soumith/code/pytorch/functorch/_src/aot_autograd.py", line 468, in backward
out = call_func_with_args(
File "/home/soumith/code/pytorch/functorch/_src/aot_autograd.py", line 255, in call_func_with_args
out = normalize_as_list(f(args))
File "/home/soumith/code/pytorch/torch/_dynamo/eval_frame.py", line 157, in _fn
return fn(*args, **kwargs)
File "/home/soumith/code/pytorch/torch/_dynamo/debug_utils.py", line 444, in deferred_for_real_inputs
raise e
File "/home/soumith/code/pytorch/torch/_dynamo/debug_utils.py", line 428, in deferred_for_real_inputs
compiled_fn = compiler_fn(gm, example_inputs, **kwargs)
File "/home/soumith/code/pytorch/torch/_inductor/debug.py", line 178, in inner
return fn(*args, **kwargs)
File "/home/soumith/miniconda3/envs/pytorch/lib/python3.9/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/home/soumith/code/pytorch/torch/_inductor/compile_fx.py", line 106, in compile_fx_inner
graph.run(*example_inputs)
File "/home/soumith/code/pytorch/torch/_dynamo/utils.py", line 85, in time_wrapper
r = func(*args, **kwargs)
File "/home/soumith/code/pytorch/torch/_inductor/graph.py", line 143, in run
return super().run(*args)
File "/home/soumith/code/pytorch/torch/fx/interpreter.py", line 130, in run
self.env[node] = self.run_node(node)
File "/home/soumith/code/pytorch/torch/_inductor/graph.py", line 311, in run_node
result = super().run_node(n)
File "/home/soumith/code/pytorch/torch/fx/interpreter.py", line 171, in run_node
return getattr(self, n.op)(n.target, args, kwargs)
File "/home/soumith/code/pytorch/torch/_inductor/graph.py", line 208, in placeholder
example: torch.Tensor = super().placeholder(target, args, kwargs)
File "/home/soumith/code/pytorch/torch/fx/interpreter.py", line 204, in placeholder
raise RuntimeError(f'Expected positional argument for parameter {target}, but one was not passed in!')
RuntimeError: Expected positional argument for parameter primals_1, but one was not passed in!
While executing %primals_1 : [#users=1] = placeholder[target=primals_1]
Original traceback:
None
System Info
Latest PyTorch built from source, pinned Triton, CUDA 11.6, single GTX 3090.
Collecting environment information...
PyTorch version: 1.14.0a0+git054a2fd
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.24.1
Libc version: glibc-2.31
Python version: 3.9.13 (main, Oct 13 2022, 21:15:33) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-50-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.6.124
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 515.65.01
cuDNN version: Probably one of the following:
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.23.1
[pip3] torch==1.14.0a0+git054a2fd
[pip3] torchvision==0.15.0a0+f467349
[conda] blas 1.0 mkl
Issue Analytics
- State:
- Created a year ago
- Comments:12 (12 by maintainers)
Top Results From Across the Web
examples\imagenet\models_test.py fails on GPU (8 x Titan-m)
Problem you have encountered: Running flax\examples\imagenet\models_test.py hits runtime error. What you expected to happen: Test to exit ...
Read more >Exception: jetson.utils -- failed to create videoSource device
I am learning about Jetson Orin and trying to implement the AI Image classification following this tutorial with Jetson-Inference.
Read more >Error with create_imagenet.sh - Google Groups
Hi, I have a problem with making lmdb of imagenet 2012 when i ran caffe/caffe-master$ . ... 0) mkdir examples/imagenet/ilsvrc12_train_lmdb failed
Read more >Dataset Preparation for Caffe - python - Stack Overflow
It looks like your program is trying to create a directory /home/hashim/caffe/examples/imagenet/ilsvrc12_train_lmdb but it is failing.
Read more >Using self trained VGG on **2 GPUs** (problem when loading it ...
Hello, I encounter a problem. I use examples/imagenet/main.py script to train a VGG (trained on 2 GPUs). Everything went fine for training, thus...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I had encountered similar issue before. I tested this minimized example and found it has the same symptom: error only on GPU, CPU works well.
as @anijain2305 said, this looks like a bug in
cudagraphify
in inductor (that won’t affect any ofTORCHINDUCTOR_TRACE=1
dumps), as disabling cudagraphs fixes this minimum example also.