Data parallel error with O2 and not O1
See original GitHub issueWhen using O2, data parallel does not work: RuntimeError: Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)
however with O1, everything works just fine.
model = GeneralVae(encoder, decoder, rep_size=500).cuda()
optimizer = optim.Adam(model.parameters(), lr=LR)
model, optimizer = amp.initialize(model, optimizer, opt_level='O2')
if data_para and torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
model = nn.DataParallel(model)
model = model.cuda()
loss_picture = customLoss()
val_losses = []
train_losses = []
def train(epoch):
train_loader_food = generate_data_loader(train_root, get_batch_size(epoch), int(rampDataSize * data_size))
print("Epoch {}: batch_size {}".format(epoch, get_batch_size(epoch)))
model.train()
train_loss = 0
loss = None
for batch_idx, (data, _, aff) in enumerate(train_loader_food):
data = data[0].cuda(0)
Issue Analytics
- State:
- Created 4 years ago
- Reactions:19
- Comments:32 (4 by maintainers)
Top Results From Across the Web
[lammps-users] Problem with running in parallel
However, I encounter the error (i.e., Bond atoms 255 279 missing on proc 802 at step 2693 ) as the simulation is running...
Read more >1 - Stack Overflow
As the error message states, the Value property of a parallel.pool.Constant is available only on the workers. As written, your parfeval ...
Read more >Distributed data parallel training in Pytorch
During training, each process loads its own minibatches from disk and passes them to its GPU. Each GPU does its own forward pass,...
Read more >Training multiple models with one dataloader - PyTorch Forums
In order to speed-up hyperparameter search, I thought it'd be a good idea to train two models, each on another GPU, simultaneously using...
Read more >Parallel apps not working on oxygenos 11 - OnePlus Community
Same on my device, no working parallel apps, with error, initialization failed. OS 11.KB05BA Oneplus 8T. Our developer team has worked on this...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@mcarilli still seeing this issue. any idea when the support for O2 + DataParallel will kick in?
thanks
Historically we only test with DistributedDataParallel because performance tends to be better, but the dataset sharing issue raised by @seongwook-ham in https://github.com/NVIDIA/apex/issues/269 is a compelling use case. @ptrblck and I will look into it. Current to-do list is better fused optimizers, checkpointing, sparse gradients, and then DataParallel, so it may be a couple weeks before I can give it undivided attention.