question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Data parallel error with O2 and not O1

See original GitHub issue

When using O2, data parallel does not work: RuntimeError: Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

however with O1, everything works just fine.

model = GeneralVae(encoder, decoder, rep_size=500).cuda()
optimizer = optim.Adam(model.parameters(), lr=LR)
model, optimizer = amp.initialize(model, optimizer, opt_level='O2')
if data_para and torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    model = nn.DataParallel(model)
    model = model.cuda()

loss_picture = customLoss()

val_losses = []
train_losses = []

def train(epoch):
    train_loader_food = generate_data_loader(train_root, get_batch_size(epoch), int(rampDataSize * data_size))
    print("Epoch {}: batch_size {}".format(epoch, get_batch_size(epoch)))
    model.train()
    train_loss = 0
    loss = None
    for batch_idx, (data, _, aff) in enumerate(train_loader_food):
        data = data[0].cuda(0)

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:19
  • Comments:32 (4 by maintainers)

github_iconTop GitHub Comments

12reactions
iariavcommented, Aug 11, 2019

@mcarilli still seeing this issue. any idea when the support for O2 + DataParallel will kick in?

thanks

12reactions
mcarillicommented, Apr 24, 2019

Historically we only test with DistributedDataParallel because performance tends to be better, but the dataset sharing issue raised by @seongwook-ham in https://github.com/NVIDIA/apex/issues/269 is a compelling use case. @ptrblck and I will look into it. Current to-do list is better fused optimizers, checkpointing, sparse gradients, and then DataParallel, so it may be a couple weeks before I can give it undivided attention.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[lammps-users] Problem with running in parallel
However, I encounter the error (i.e., Bond atoms 255 279 missing on proc 802 at step 2693 ) as the simulation is running...
Read more >
1 - Stack Overflow
As the error message states, the Value property of a parallel.pool.Constant is available only on the workers. As written, your parfeval ...
Read more >
Distributed data parallel training in Pytorch
During training, each process loads its own minibatches from disk and passes them to its GPU. Each GPU does its own forward pass,...
Read more >
Training multiple models with one dataloader - PyTorch Forums
In order to speed-up hyperparameter search, I thought it'd be a good idea to train two models, each on another GPU, simultaneously using...
Read more >
Parallel apps not working on oxygenos 11 - OnePlus Community
Same on my device, no working parallel apps, with error, initialization failed. OS 11.KB05BA Oneplus 8T. Our developer team has worked on this...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found