Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Data parallel error with O2 and not O1

See original GitHub issue

When using O2, data parallel does not work: RuntimeError: Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

however with O1, everything works just fine.

model = GeneralVae(encoder, decoder, rep_size=500).cuda()
optimizer = optim.Adam(model.parameters(), lr=LR)
model, optimizer = amp.initialize(model, optimizer, opt_level='O2')
if data_para and torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    model = nn.DataParallel(model)
    model = model.cuda()

loss_picture = customLoss()

val_losses = []
train_losses = []

def train(epoch):
    train_loader_food = generate_data_loader(train_root, get_batch_size(epoch), int(rampDataSize * data_size))
    print("Epoch {}: batch_size {}".format(epoch, get_batch_size(epoch)))
    model.train()
    train_loss = 0
    loss = None
    for batch_idx, (data, _, aff) in enumerate(train_loader_food):
        data = data[0].cuda(0)

Issue Analytics

State:
Created 4 years ago
Reactions:19
Comments:32 (4 by maintainers)

Top GitHub Comments

12reactions

iariavcommented, Aug 11, 2019

@mcarilli still seeing this issue. any idea when the support for O2 + DataParallel will kick in?

thanks

12reactions

mcarillicommented, Apr 24, 2019

Historically we only test with DistributedDataParallel because performance tends to be better, but the dataset sharing issue raised by @seongwook-ham in https://github.com/NVIDIA/apex/issues/269 is a compelling use case. @ptrblck and I will look into it. Current to-do list is better fused optimizers, checkpointing, sparse gradients, and then DataParallel, so it may be a couple weeks before I can give it undivided attention.