question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Problem with multi-GPU training

See original GitHub issue

Hello,

I have successfully built maskrcnn_benchmark on Ubuntu 16.04. My workstation has 4x1080Ti (CUDA 9.2, cuDNN 7, Nvidia drivers 410.48) and I tried to train on COCO dataset on multiple GPUs. I used the script provided in “Perform training on COCO dataset” section.

One GPU worked fine with:

python tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_50_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1

Then I used

export NGPUS=2
python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_50_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 4 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1

(the same train_net.py file and the same config, changed images per batch to 4) and everything worked fine.

Next I tried the same thing for 3 GPUs (changed NGPUS=3, images per batch to 6) and the training gets stuck during the first interation. I have the following logging information and it does not change:

2018-10-29 17:20:55,722 maskrcnn_benchmark.trainer INFO: Start training
2018-10-29 17:20:58,453 maskrcnn_benchmark.trainer INFO: eta: 22 days, 18:04:18  iter: 0  loss: 6.7175 (6.7175)  loss_classifier: 4.4688 (4.4688)  loss_box_reg: 0.0044 (0.0044)  loss_mask: 1.4084 (1.4084)  loss_objectness: 0.7262 (0.7262)  loss_rpn_box_reg: 0.1097 (0.1097)  time: 2.7304 (2.7304)  data: 2.4296 (2.4296)  lr: 0.000833  max mem: 1749

The GPU memory is used, the temperature goes up, but nothing is happening (I tried multiple times and then gave up).

Any ideas? I’d be grateful for help.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:23 (16 by maintainers)

github_iconTop GitHub Comments

1reaction
chengyangfucommented, Jan 30, 2019

Recently, I updated my PyTorch to v1.0.0 and it solved this problem.

Driver Version: 415.27 CUDA version: cuda_9.2.148_396.37 + patch 1 CUDNN version: cudnn-9.2-linux-x64-v7.3.1 NCCL version: nccl_2.3.7-1+cuda9.2

1reaction
chengyangfucommented, Nov 1, 2018

I have the same problem too.

Environment: Python: 3.5 GPU : 4 1080Ti. CUDA : 9.0 (with all the patches) CuDNN: 7.1 NCCL2: download from Nvidia Nvidia Driver: 390, 396, 410 PyTorch: Compiled from the source ( v1.0rc0, and v1.0rc1) Ubuntu : 16.04

The bug is weird to me. If I only two GPUs, everything is fine. If I try to use 4 GPUs, sometimes it occurs. P.S. I also found out when I use Nvidia driver 410, the frequency is much lower.

Read more comments on GitHub >

github_iconTop Results From Across the Web

A problem when using "multi-gpu" as ...
I am experiencing weird problems when I use the “multi-gpu” as the “ExecutionEnvironment” in the training option for training a CNN.
Read more >
13.5. Training on Multiple GPUs
GPU memory used to be a problem in the early days of deep learning. By now this issue has been resolved for all...
Read more >
Multi GPU training is stuck · Issue #9242 · Lightning-AI ...
Bug To Reproduce I am experienceing the same problem. Except that it does not work in 'dp' ... I ran the cifar-10 example...
Read more >
Multi GPU Model Training: Monitoring and Optimizing
In this article, we will discuss multi GPU training with Pytorch Lightning and find out the best practices that should be adopted to ......
Read more >
A gotcha with multi-GPU training of dynamic neural networks ...
I recently ran into an issue with training/testing dynamic neural network architectures on multiple GPUs in PyTorch.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found