question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

inception_v3 of vision 0.3.0 does not fit in DataParallel of torch 1.1.0

See original GitHub issue

Environment: Python 3.5 torch 1.1.0 torchvision 0.3.0

Reproducible example: import torch import torchvision model = torchvision.models.inception_v3().cuda() model = torch.nn.DataParallel(model, [0, 1]) x = torch.rand((8, 3, 299, 299)).cuda() model.forward(x)

Error:

Traceback (most recent call last): File “<stdin>”, line 1, in <module> File “env/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 493, in call result = self.forward(*input, **kwargs) File “env/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py”, line 153, in forward return self.gather(outputs, self.output_device) File “/env/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py”, line 165, in gather return gather(outputs, output_device, dim=self.dim) File “/env/lib/python3.5/site-packages/torch/nn/parallel/scatter_gather.py”, line 67, in gather return gather_map(outputs) File “env/lib/python3.5/site-packages/torch/nn/parallel/scatter_gather.py”, line 62, in gather_map return type(out)(map(gather_map, zip(*outputs))) TypeError: new() missing 1 required positional argument: ‘aux_logits’

I guess the error occurs because the output of inception_v3 was changed from tuple to namedtuple.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:9 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
sanka4reacommented, Aug 13, 2020

I tried out your solution @YongWookHa, but got an error as shown below:

`train Loss: 0.9664 Acc: 0.5738

Traceback (most recent call last): File “/home/xxx/anaconda3/envs/torch0721/lib/python3.7/site-packages/IPython/core/interactiveshell.py”, line 3343, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File “<ipython-input-2-01e31a117c9f>”, line 153, in <module> num_epochs=25, is_inception=True) File “<ipython-input-2-01e31a117c9f>”, line 91, in train_model outputs, aux_outputs = model(inputs).values() RuntimeError: Could not run ‘aten::values’ with arguments from the ‘CUDA’ backend. ‘aten::values’ is only available for these backends: [SparseCPU, SparseCUDA, Autograd, Profiler, Tracer].`

Could you please give me some suggestions?

Edit: fixed. As there is no need to use the aux classifiers for inference, i change the code to:

if phase == ‘train’:

    outputs, aux_outputs = model(inputs).values()
    loss1 = criterion(outputs, labels)
    loss2 = criterion(aux_outputs, labels)
    loss = loss1 + 0.4 * loss2

else:

    outputs = model(inputs)
    loss = criterion(outputs, labels)

Thanks!

1reaction
soumendukrgcommented, Nov 19, 2019

Yes, I did add values, but I was copying model.values only to single output instead of output, aux_output, and so when computing loss function on dict instead of a tensor, I got the error.

Thanks, but your method solved me hours of training time. Earlier, I had to train inception only one a single GPU, not modifying pytorch file using your code, I am able to train on more than 1 GPU.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Unable to finetune pretrained inception_v3 in multi-gpu training
PyTorch Version: 1.1.0. Torchvision Version: 0.3.0. I'm trying to finetune inception_v3 these days but meet a bug: Blockquote
Read more >
Could not find a version that satisfies the requirement torch ...
I got the following error when I tried to install this file: ERROR: torchvision-0.3.0-cp37-cp37m-win_amd64.whl is not a supported wheel on this ...
Read more >
Data Parallel Inference on Torch Neuron
The following sections explain how data parallelism can improve the performance of inference workloads on Inferentia, including how torch.neuron.
Read more >
Training Transformer models using Distributed Data Parallel ...
Transformer and TorchText tutorial, but is split into two stages. ... any extra elements that wouldn't cleanly fit (remainders). data = data.narrow(0, 0, ......
Read more >
StudioGAN is a Pytorch library providing implementations of ...
StudioGAN does not support DDP training for ContraGAN. This is because conducting contrastive learning requires a 'gather' operation to ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found