inception_v3 of vision 0.3.0 does not fit in DataParallel of torch 1.1.0
See original GitHub issueEnvironment: Python 3.5 torch 1.1.0 torchvision 0.3.0
Reproducible example:
import torch
import torchvision
model = torchvision.models.inception_v3().cuda()
model = torch.nn.DataParallel(model, [0, 1])
x = torch.rand((8, 3, 299, 299)).cuda()
model.forward(x)
Error:
Traceback (most recent call last): File “<stdin>”, line 1, in <module> File “env/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 493, in call result = self.forward(*input, **kwargs) File “env/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py”, line 153, in forward return self.gather(outputs, self.output_device) File “/env/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py”, line 165, in gather return gather(outputs, output_device, dim=self.dim) File “/env/lib/python3.5/site-packages/torch/nn/parallel/scatter_gather.py”, line 67, in gather return gather_map(outputs) File “env/lib/python3.5/site-packages/torch/nn/parallel/scatter_gather.py”, line 62, in gather_map return type(out)(map(gather_map, zip(*outputs))) TypeError: new() missing 1 required positional argument: ‘aux_logits’
I guess the error occurs because the output of inception_v3 was changed from tuple to namedtuple.
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (3 by maintainers)
I tried out your solution @YongWookHa, but got an error as shown below:
`train Loss: 0.9664 Acc: 0.5738
Traceback (most recent call last): File “/home/xxx/anaconda3/envs/torch0721/lib/python3.7/site-packages/IPython/core/interactiveshell.py”, line 3343, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File “<ipython-input-2-01e31a117c9f>”, line 153, in <module> num_epochs=25, is_inception=True) File “<ipython-input-2-01e31a117c9f>”, line 91, in train_model outputs, aux_outputs = model(inputs).values() RuntimeError: Could not run ‘aten::values’ with arguments from the ‘CUDA’ backend. ‘aten::values’ is only available for these backends: [SparseCPU, SparseCUDA, Autograd, Profiler, Tracer].`
Could you please give me some suggestions?
Edit: fixed. As there is no need to use the aux classifiers for inference, i change the code to:
if phase == ‘train’:
else:
Thanks!
Yes, I did add values, but I was copying
model.values
only to single output instead ofoutput, aux_output
, and so when computing loss function on dict instead of a tensor, I got the error.Thanks, but your method solved me hours of training time. Earlier, I had to train inception only one a single GPU, not modifying pytorch file using your code, I am able to train on more than 1 GPU.