Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

multi-gpu is unstable?

See original GitHub issue

If you do not know the root cause of the problem / bug, and wish someone to help you, please include:

To Reproduce

what changes you made / what code you wrote

In tools/train_net.py, I add new dataset in the beginning of main function.

def main(args):  
    register_coco_instances("moda", {}, "moda.json", "datasets/moda/images")

You can download moda.json here You can also download partial of moda images here Full images are here. Not recommended due to large size.

configs/modanet.yaml as follows:

_BASE_: "./Base-RCNN-FPN.yaml"
MODEL:
  WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl"
  # WEIGHTS: "detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl"  # initialize from model zoo
  MASK_ON: True
  RESNETS:
    DEPTH: 50
  ROI_HEADS:
    NUM_CLASSES: 13
DATASETS:
  TRAIN: ("moda",)
  TEST: ()
DATALOADER:
  ASPECT_RATIO_GROUPING: False
  # NUM_WORKERS: 4
SOLVER:
  IMS_PER_BATCH: 20
  BASE_LR: 0.01
  STEPS: (60000, 80000)
  MAX_ITER: 90000

what command you run

python tools/train_net.py --num-gpus 4 --config-file configs/modanet.yaml

what you observed (full logs are preferred)

When I use single GPU, it always works fine! But, when I tried to use multiple GPU several bugs occur randomly. Rarely, multiple GPU works fine. What’s wrong with it?

First bug is about Box2BoxTransform. When I debug it, the anchor’s width is lower than 0.

[10/14 15:39:39 d2.engine.train_loop]: Starting training from iteration 0
Traceback (most recent call last):
  File "modanet.py", line 162, in <module>
    args=(args,),
  File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 49, in launch
    daemon=False,
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.
py", line 171, in spawn
    while not spawn_context.join():
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.
py", line 118, in join
    raise Exception(msg)
Exception:


-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.
py", line 19, in _wrap
    fn(i, *args)
  File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 84, in _distributed_worker
    main_func(*args)
  File "/SSD/hyunsu/detectron2/modanet.py", line 146, in main
    return trainer.train()
  File "/SSD/hyunsu/detectron2/detectron2/engine/defaults.py", line 329, in train
    super().train(self.start_iter, self.max_iter)
  File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 212, in run_step
    loss_dict = self.model(data)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py",
line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/parallel/distribute
d.py", line 442, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py",
line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 82, in forward
    proposals, proposal_losses = self.proposal_generator(images, features, gt_instances)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py",
line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn.py", line 161, in forward
    losses = {k: v * self.loss_weight for k, v in outputs.losses().items()}
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn_outputs.py", line 316, in l
osses
    gt_objectness_logits, gt_anchor_deltas = self._get_ground_truth()
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn_outputs.py", line 283, in _
get_ground_truth
    anchors_i.tensor, matched_gt_boxes.tensor
  File "/SSD/hyunsu/detectron2/detectron2/modeling/box_regression.py", line 63, in get_deltas
    assert (src_widths > 0).all().item(), "Input boxes to Box2BoxTransform are not valid!"
AssertionError: Input boxes to Box2BoxTransform are not valid!

Second bug is as follows:


Traceback (most recent call last):
  File "modanet.py", line 162, in <module>
    args=(args,),
  File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 49, in launch
    daemon=False,
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception:


-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 84, in _distributed_worker
    main_func(*args)
  File "/SSD/hyunsu/detectron2/modanet.py", line 146, in main
    return trainer.train()
  File "/SSD/hyunsu/detectron2/detectron2/engine/defaults.py", line 329, in train
    super().train(self.start_iter, self.max_iter)
  File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 212, in run_step
    loss_dict = self.model(data)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 442, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 82, in forward
    proposals, proposal_losses = self.proposal_generator(images, features, gt_instances)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn.py", line 143, in forward
    anchors = self.anchor_generator(features)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/anchor_generator.py", line 181, in forward
    anchors_over_all_feature_maps = self.grid_anchors(grid_sizes)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/anchor_generator.py", line 124, in grid_anchors
    shift_x, shift_y = _create_grid_offsets(size, stride, base_anchors.device)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/anchor_generator.py", line 43, in _create_grid_offsets
    shifts_x = torch.arange(0, grid_width * stride, step=stride, dtype=torch.float32, device=device)
RuntimeError: tabulate: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Third bug is as follows:

[10/14 15:42:59 d2.engine.train_loop]: Starting training from iteration 0
Traceback (most recent call last):
  File "modanet.py", line 162, in <module>
    args=(args,),
  File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 49, in launch
    daemon=False,
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception:


-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 84, in _distributed_worker
    main_func(*args)
  File "/SSD/hyunsu/detectron2/modanet.py", line 146, in main
    return trainer.train()
  File "/SSD/hyunsu/detectron2/detectron2/engine/defaults.py", line 329, in train
    super().train(self.start_iter, self.max_iter)
  File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 212, in run_step
    loss_dict = self.model(data)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 442, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 82, in forward
    proposals, proposal_losses = self.proposal_generator(images, features, gt_instances)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn.py", line 161, in forward
    losses = {k: v * self.loss_weight for k, v in outputs.losses().items()}
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn_outputs.py", line 316, in losses
    gt_objectness_logits, gt_anchor_deltas = self._get_ground_truth()
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn_outputs.py", line 268, in _get_ground_truth
    matched_idxs, gt_objectness_logits_i = self.anchor_matcher(match_quality_matrix)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/matcher.py", line 78, in __call__
    assert torch.all(match_quality_matrix >= 0)
AssertionError

When it works, the distribution between GPUs is unbalanced as follows:

(detectron2) root@b06e1b5c1ffb:/SSD/hyunsu/detectron2# nvidia-smi
Sun Oct 13 16:49:43 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            On   | 00000000:83:00.0 Off |                  N/A |
| 39%   65C    P2   206W / 250W |  11539MiB / 12196MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            On   | 00000000:84:00.0 Off |                  N/A |
| 40%   65C    P2   206W / 250W |   8375MiB / 12196MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            On   | 00000000:87:00.0 Off |                  N/A |
| 47%   75C    P2   236W / 250W |  11089MiB / 12196MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            On   | 00000000:88:00.0 Off |                  N/A |
| 49%   79C    P2   257W / 250W |   8409MiB / 12196MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Expected behavior

It should work both single and multiple GPUs. But, I feel like it’s quite unstable when I use multiple GPU.

Environment

Please paste the output of python -m detectron2.utils.collect_env.

/home/user/miniconda/envs/detectron2/bin/python: Error while finding module specification for 'detectron2.utils.collect_env.' (ModuleNotFoundError: __path__ attribute not found on 'detectron2.utils.collect_env' while trying to find 'detectron2.utils.collect_env.')
(detectron2) root@b06e1b5c1ffb:/SSD/hyunsu/detectron2# python -m detectron2.utils.collect_env .
---------------------  --------------------------------------------------
Python                 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0]
Detectron2 Compiler    GCC 5.4
DETECTRON2_ENV_MODULE  <not set>
PyTorch                1.3.0
PyTorch Debug Build    False
CUDA available         True
GPU 0,1,2,3            TITAN Xp
Pillow                 6.2.0
cv2                    4.1.1
---------------------  --------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_50,code=compute_50
  - CuDNN 7.6.3
  - Magma 2.5.1
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

Issue Analytics

State:
Created 4 years ago
Comments:6

Top GitHub Comments

2reactions

defqooncommented, Feb 14, 2020

@ZhouDongliang did you check that the number of classes in your config file is correct? I had the same bug and I fixed it by setting cfg.MODEL.ROI_HEADS.NUM_CLASSES for my Faster Rcnn model to the correct value

0reactions

ChauncyFrcommented, May 16, 2020

After I add "bbox_mode": 1 in my json file, above errors are resolved(My dataset has XYWH_ABS type of BBox). But I think there are still race conditions. Because cuda error sometimes occurs even if the code is same.

I set this item in the json file, it is not possible, no matter the number of GPUs is 1 or 2. Who can answer me, when I train, this error always occurs and the training is interrupted, which is uncomfortable!