question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DataParallelModel seems to be stuck

See original GitHub issue

Hello, @zhanghang1989.

I am working with your code to train a PSPNet50 on Pascal VOC. However, the training process was stuck at this line. After a closer inspection, I found it was stuck at layer3 in base_forward. The following is my training script.

#!/bin/bash

CUDA_VISIBLE_DEVICES=0,3 python train.py \
  --dataset pascal_voc \
  --model psp \
  --backbone resnet50 \
  --batch-size 8 \
  --aux

I am still looking into this problem. Since I found no similar issues, I posted it here. If you have any idea about this issue, could you please kindly share it with me? Thank you.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:13 (2 by maintainers)

github_iconTop GitHub Comments

4reactions
jianchao-licommented, Sep 26, 2018

Hi, @zhanghang1989, @xiongzhanblake.

I found this reply which seems relevant. I actually came across this problem when using 4 Tesla-V100-PCIE-16GB GPUs with CUDA 9.0. And I also met it when using 4 GeForce-GTX-1080-Ti GPUs.

By the way, I also met the probably deadlock problem with multi-gpu training of gluon-cv. So I think this problem may not be due to codes but some other factors.

As said in the above comment, I switched from PyTorch-Encoding to gluon-cv for training PSPNet using SyncBatchNorm. And the training codes of gluon-cv was also stuck. After several days of investigation, I finally found a way to get rid of this problem.

I was running the experiment in a GPU cluster, which is kind of like a cloud environment in which you can set up a virtual machine with specified number of resources (like CPUs, GPUs, memory etc.). And the stuck problem happened when I was using 4 GPUs while using 2 GPUs worked fine. Previously, the number of CPUs were 5 and the memory was 10 GB by default. And training on 4 GPUs got stuck. On this morning I changed the number of CPUs and the amount of memory to be 8 and 20 GB respectively and it worked! Now I am happily training PSPNet on 4 GPUs with SyncBatchNorm in gluon-cv.

I think the stuck should be due to the resource settings. However, I still have no concrete idea about how it interferes with the training and leads to the stuck. But as the problem is resolved for now, I will close this issue. Thank you for the discussion.

1reaction
xiongzhanblakecommented, Sep 24, 2018

Hi, @jianchao-li. I tried p2pBandwidthLatencyTest as you suggested, and found my GPUS can communicate to each other. Thus, you might have to figure out this problem first! Then, I found out the reason for deadlock in multi-gpu training. It turns out to be syncbn does not check gpu-memory runing-out error @zhanghang1989 . The proof is that when I changed encoding.nn.BatchNorm2d back to original torch.nn.BatchNorm2d, training process detected gpu memory run-out error. And when I used back encoding.nn.BatchNorm2d, no error poped up but a deadlock happened instead. Finally, I chose a smaller batch-size, no deadlock occurred any more. Now, I can train my network on cityscapes with encoding pretty well!

I Hope my solution can help you as well! Cheers!

Read more comments on GitHub >

github_iconTop Results From Across the Web

nn.DataParallel gets stuck - PyTorch Forums
I'm trying to train a model on multiGPU using nn.DataParallel and the program gets stuck. (in the sense I can't even ctrl+c to...
Read more >
PyTorch 0.4 hangs with nn.DataParallel but PyTorch 0.3.1 ...
Issue description The snippet below hangs with PyTorch 0.4 but successfully finishes with PyTorch 0.3.1. I found that removing model = nn.
Read more >
Pytorch DataParallel doesn't work when the model contain ...
I have no experience with DataParallel , but I think it might be because your tensor is not part of the model parameters....
Read more >
Notes on parallel/distributed training in PyTorch - Kaggle
An implementation detail is that the DataParallel module is frozen. This is because it defines additional object properties which may collide with custom...
Read more >
Fully Sharded Data Parallel: faster AI training with fewer GPUs
It shards an AI model's parameters across data parallel workers and can optionally offload part of ... Thanks for sticking with us thus...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found