Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

inception_v3 of vision 0.3.0 does not fit in DataParallel of torch 1.1.0

See original GitHub issue

Environment: Python 3.5 torch 1.1.0 torchvision 0.3.0

Reproducible example: import torch import torchvision model = torchvision.models.inception_v3().cuda() model = torch.nn.DataParallel(model, [0, 1]) x = torch.rand((8, 3, 299, 299)).cuda() model.forward(x)

Error:

Traceback (most recent call last): File “<stdin>”, line 1, in <module> File “env/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 493, in call result = self.forward(*input, **kwargs) File “env/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py”, line 153, in forward return self.gather(outputs, self.output_device) File “/env/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py”, line 165, in gather return gather(outputs, output_device, dim=self.dim) File “/env/lib/python3.5/site-packages/torch/nn/parallel/scatter_gather.py”, line 67, in gather return gather_map(outputs) File “env/lib/python3.5/site-packages/torch/nn/parallel/scatter_gather.py”, line 62, in gather_map return type(out)(map(gather_map, zip(*outputs))) TypeError: new() missing 1 required positional argument: ‘aux_logits’

I guess the error occurs because the output of inception_v3 was changed from tuple to namedtuple.

Issue Analytics

State:
Created 4 years ago
Comments:9 (3 by maintainers)

Top GitHub Comments

1reaction

sanka4reacommented, Aug 13, 2020

I tried out your solution @YongWookHa, but got an error as shown below:

`train Loss: 0.9664 Acc: 0.5738

Traceback (most recent call last): File “/home/xxx/anaconda3/envs/torch0721/lib/python3.7/site-packages/IPython/core/interactiveshell.py”, line 3343, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File “<ipython-input-2-01e31a117c9f>”, line 153, in <module> num_epochs=25, is_inception=True) File “<ipython-input-2-01e31a117c9f>”, line 91, in train_model outputs, aux_outputs = model(inputs).values() RuntimeError: Could not run ‘aten::values’ with arguments from the ‘CUDA’ backend. ‘aten::values’ is only available for these backends: [SparseCPU, SparseCUDA, Autograd, Profiler, Tracer].`

Could you please give me some suggestions?

Edit: fixed. As there is no need to use the aux classifiers for inference, i change the code to:

if phase == ‘train’:

    outputs, aux_outputs = model(inputs).values()
    loss1 = criterion(outputs, labels)
    loss2 = criterion(aux_outputs, labels)
    loss = loss1 + 0.4 * loss2

else:

    outputs = model(inputs)
    loss = criterion(outputs, labels)

Thanks!

1reaction

soumendukrgcommented, Nov 19, 2019

Yes, I did add values, but I was copying model.values only to single output instead of output, aux_output, and so when computing loss function on dict instead of a tensor, I got the error.

Thanks, but your method solved me hours of training time. Earlier, I had to train inception only one a single GPU, not modifying pytorch file using your code, I am able to train on more than 1 GPU.

Top Results From Across the Web

Unable to finetune pretrained inception_v3 in multi-gpu training

PyTorch Version: 1.1.0. Torchvision Version: 0.3.0. I'm trying to finetune inception_v3 these days but meet a bug: Blockquote

Could not find a version that satisfies the requirement torch ...

I got the following error when I tried to install this file: ERROR: torchvision-0.3.0-cp37-cp37m-win_amd64.whl is not a supported wheel on this ...

Data Parallel Inference on Torch Neuron

The following sections explain how data parallelism can improve the performance of inference workloads on Inferentia, including how torch.neuron.

Training Transformer models using Distributed Data Parallel ...

Transformer and TorchText tutorial, but is split into two stages. ... any extra elements that wouldn't cleanly fit (remainders). data = data.narrow(0, 0, ......

StudioGAN is a Pytorch library providing implementations of ...

StudioGAN does not support DDP training for ContraGAN. This is because conducting contrastive learning requires a 'gather' operation to ...