Problem with multi-GPU training
See original GitHub issueHello,
I have successfully built maskrcnn_benchmark on Ubuntu 16.04. My workstation has 4x1080Ti (CUDA 9.2, cuDNN 7, Nvidia drivers 410.48) and I tried to train on COCO dataset on multiple GPUs. I used the script provided in “Perform training on COCO dataset” section.
One GPU worked fine with:
python tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_50_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1
Then I used
export NGPUS=2
python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_50_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 4 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1
(the same train_net.py file and the same config, changed images per batch to 4) and everything worked fine.
Next I tried the same thing for 3 GPUs (changed NGPUS=3, images per batch to 6) and the training gets stuck during the first interation. I have the following logging information and it does not change:
2018-10-29 17:20:55,722 maskrcnn_benchmark.trainer INFO: Start training
2018-10-29 17:20:58,453 maskrcnn_benchmark.trainer INFO: eta: 22 days, 18:04:18 iter: 0 loss: 6.7175 (6.7175) loss_classifier: 4.4688 (4.4688) loss_box_reg: 0.0044 (0.0044) loss_mask: 1.4084 (1.4084) loss_objectness: 0.7262 (0.7262) loss_rpn_box_reg: 0.1097 (0.1097) time: 2.7304 (2.7304) data: 2.4296 (2.4296) lr: 0.000833 max mem: 1749
The GPU memory is used, the temperature goes up, but nothing is happening (I tried multiple times and then gave up).
Any ideas? I’d be grateful for help.
Issue Analytics
- State:
- Created 5 years ago
- Comments:23 (16 by maintainers)
Top GitHub Comments
Recently, I updated my PyTorch to v1.0.0 and it solved this problem.
Driver Version: 415.27 CUDA version: cuda_9.2.148_396.37 + patch 1 CUDNN version: cudnn-9.2-linux-x64-v7.3.1 NCCL version: nccl_2.3.7-1+cuda9.2
I have the same problem too.
Environment: Python: 3.5 GPU : 4 1080Ti. CUDA : 9.0 (with all the patches) CuDNN: 7.1 NCCL2: download from Nvidia Nvidia Driver: 390, 396, 410 PyTorch: Compiled from the source ( v1.0rc0, and v1.0rc1) Ubuntu : 16.04
The bug is weird to me. If I only two GPUs, everything is fine. If I try to use 4 GPUs, sometimes it occurs. P.S. I also found out when I use Nvidia driver 410, the frequency is much lower.