DataParallelModel seems to be stuck
See original GitHub issueHello, @zhanghang1989.
I am working with your code to train a PSPNet50 on Pascal VOC. However, the training process was stuck at this line. After a closer inspection, I found it was stuck at layer3
in base_forward
. The following is my training script.
#!/bin/bash
CUDA_VISIBLE_DEVICES=0,3 python train.py \
--dataset pascal_voc \
--model psp \
--backbone resnet50 \
--batch-size 8 \
--aux
I am still looking into this problem. Since I found no similar issues, I posted it here. If you have any idea about this issue, could you please kindly share it with me? Thank you.
Issue Analytics
- State:
- Created 5 years ago
- Comments:13 (2 by maintainers)
Top Results From Across the Web
nn.DataParallel gets stuck - PyTorch Forums
I'm trying to train a model on multiGPU using nn.DataParallel and the program gets stuck. (in the sense I can't even ctrl+c to...
Read more >PyTorch 0.4 hangs with nn.DataParallel but PyTorch 0.3.1 ...
Issue description The snippet below hangs with PyTorch 0.4 but successfully finishes with PyTorch 0.3.1. I found that removing model = nn.
Read more >Pytorch DataParallel doesn't work when the model contain ...
I have no experience with DataParallel , but I think it might be because your tensor is not part of the model parameters....
Read more >Notes on parallel/distributed training in PyTorch - Kaggle
An implementation detail is that the DataParallel module is frozen. This is because it defines additional object properties which may collide with custom...
Read more >Fully Sharded Data Parallel: faster AI training with fewer GPUs
It shards an AI model's parameters across data parallel workers and can optionally offload part of ... Thanks for sticking with us thus...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi, @zhanghang1989, @xiongzhanblake.
As said in the above comment, I switched from
PyTorch-Encoding
togluon-cv
for training PSPNet usingSyncBatchNorm
. And the training codes ofgluon-cv
was also stuck. After several days of investigation, I finally found a way to get rid of this problem.I was running the experiment in a GPU cluster, which is kind of like a cloud environment in which you can set up a virtual machine with specified number of resources (like CPUs, GPUs, memory etc.). And the stuck problem happened when I was using 4 GPUs while using 2 GPUs worked fine. Previously, the number of CPUs were 5 and the memory was 10 GB by default. And training on 4 GPUs got stuck. On this morning I changed the number of CPUs and the amount of memory to be 8 and 20 GB respectively and it worked! Now I am happily training PSPNet on 4 GPUs with
SyncBatchNorm
ingluon-cv
.I think the stuck should be due to the resource settings. However, I still have no concrete idea about how it interferes with the training and leads to the stuck. But as the problem is resolved for now, I will close this issue. Thank you for the discussion.
Hi, @jianchao-li. I tried p2pBandwidthLatencyTest as you suggested, and found my GPUS can communicate to each other. Thus, you might have to figure out this problem first! Then, I found out the reason for deadlock in multi-gpu training. It turns out to be syncbn does not check gpu-memory runing-out error @zhanghang1989 . The proof is that when I changed encoding.nn.BatchNorm2d back to original torch.nn.BatchNorm2d, training process detected gpu memory run-out error. And when I used back encoding.nn.BatchNorm2d, no error poped up but a deadlock happened instead. Finally, I chose a smaller batch-size, no deadlock occurred any more. Now, I can train my network on cityscapes with encoding pretty well!
I Hope my solution can help you as well! Cheers!