multi-gpu is unstable?
See original GitHub issueIf you do not know the root cause of the problem / bug, and wish someone to help you, please include:
To Reproduce
- what changes you made / what code you wrote
In tools/train_net.py, I add new dataset in the beginning of main function.
def main(args):
register_coco_instances("moda", {}, "moda.json", "datasets/moda/images")
You can download moda.json here You can also download partial of moda images here Full images are here. Not recommended due to large size.
configs/modanet.yaml as follows:
_BASE_: "./Base-RCNN-FPN.yaml"
MODEL:
WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl"
# WEIGHTS: "detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl" # initialize from model zoo
MASK_ON: True
RESNETS:
DEPTH: 50
ROI_HEADS:
NUM_CLASSES: 13
DATASETS:
TRAIN: ("moda",)
TEST: ()
DATALOADER:
ASPECT_RATIO_GROUPING: False
# NUM_WORKERS: 4
SOLVER:
IMS_PER_BATCH: 20
BASE_LR: 0.01
STEPS: (60000, 80000)
MAX_ITER: 90000
- what command you run
python tools/train_net.py --num-gpus 4 --config-file configs/modanet.yaml
- what you observed (full logs are preferred)
When I use single GPU, it always works fine! But, when I tried to use multiple GPU several bugs occur randomly. Rarely, multiple GPU works fine. What’s wrong with it?
First bug is about Box2BoxTransform. When I debug it, the anchor’s width is lower than 0.
[10/14 15:39:39 d2.engine.train_loop]: Starting training from iteration 0
Traceback (most recent call last):
File "modanet.py", line 162, in <module>
args=(args,),
File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 49, in launch
daemon=False,
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.
py", line 171, in spawn
while not spawn_context.join():
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.
py", line 118, in join
raise Exception(msg)
Exception:
-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.
py", line 19, in _wrap
fn(i, *args)
File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 84, in _distributed_worker
main_func(*args)
File "/SSD/hyunsu/detectron2/modanet.py", line 146, in main
return trainer.train()
File "/SSD/hyunsu/detectron2/detectron2/engine/defaults.py", line 329, in train
super().train(self.start_iter, self.max_iter)
File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 132, in train
self.run_step()
File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 212, in run_step
loss_dict = self.model(data)
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py",
line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/parallel/distribute
d.py", line 442, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py",
line 541, in __call__
result = self.forward(*input, **kwargs)
File "/SSD/hyunsu/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 82, in forward
proposals, proposal_losses = self.proposal_generator(images, features, gt_instances)
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py",
line 541, in __call__
result = self.forward(*input, **kwargs)
File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn.py", line 161, in forward
losses = {k: v * self.loss_weight for k, v in outputs.losses().items()}
File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn_outputs.py", line 316, in l
osses
gt_objectness_logits, gt_anchor_deltas = self._get_ground_truth()
File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn_outputs.py", line 283, in _
get_ground_truth
anchors_i.tensor, matched_gt_boxes.tensor
File "/SSD/hyunsu/detectron2/detectron2/modeling/box_regression.py", line 63, in get_deltas
assert (src_widths > 0).all().item(), "Input boxes to Box2BoxTransform are not valid!"
AssertionError: Input boxes to Box2BoxTransform are not valid!
Second bug is as follows:
Traceback (most recent call last):
File "modanet.py", line 162, in <module>
args=(args,),
File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 49, in launch
daemon=False,
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 84, in _distributed_worker
main_func(*args)
File "/SSD/hyunsu/detectron2/modanet.py", line 146, in main
return trainer.train()
File "/SSD/hyunsu/detectron2/detectron2/engine/defaults.py", line 329, in train
super().train(self.start_iter, self.max_iter)
File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 132, in train
self.run_step()
File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 212, in run_step
loss_dict = self.model(data)
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 442, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/SSD/hyunsu/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 82, in forward
proposals, proposal_losses = self.proposal_generator(images, features, gt_instances)
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn.py", line 143, in forward
anchors = self.anchor_generator(features)
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/SSD/hyunsu/detectron2/detectron2/modeling/anchor_generator.py", line 181, in forward
anchors_over_all_feature_maps = self.grid_anchors(grid_sizes)
File "/SSD/hyunsu/detectron2/detectron2/modeling/anchor_generator.py", line 124, in grid_anchors
shift_x, shift_y = _create_grid_offsets(size, stride, base_anchors.device)
File "/SSD/hyunsu/detectron2/detectron2/modeling/anchor_generator.py", line 43, in _create_grid_offsets
shifts_x = torch.arange(0, grid_width * stride, step=stride, dtype=torch.float32, device=device)
RuntimeError: tabulate: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Third bug is as follows:
[10/14 15:42:59 d2.engine.train_loop]: Starting training from iteration 0
Traceback (most recent call last):
File "modanet.py", line 162, in <module>
args=(args,),
File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 49, in launch
daemon=False,
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 84, in _distributed_worker
main_func(*args)
File "/SSD/hyunsu/detectron2/modanet.py", line 146, in main
return trainer.train()
File "/SSD/hyunsu/detectron2/detectron2/engine/defaults.py", line 329, in train
super().train(self.start_iter, self.max_iter)
File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 132, in train
self.run_step()
File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 212, in run_step
loss_dict = self.model(data)
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 442, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/SSD/hyunsu/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 82, in forward
proposals, proposal_losses = self.proposal_generator(images, features, gt_instances)
File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn.py", line 161, in forward
losses = {k: v * self.loss_weight for k, v in outputs.losses().items()}
File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn_outputs.py", line 316, in losses
gt_objectness_logits, gt_anchor_deltas = self._get_ground_truth()
File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn_outputs.py", line 268, in _get_ground_truth
matched_idxs, gt_objectness_logits_i = self.anchor_matcher(match_quality_matrix)
File "/SSD/hyunsu/detectron2/detectron2/modeling/matcher.py", line 78, in __call__
assert torch.all(match_quality_matrix >= 0)
AssertionError
When it works, the distribution between GPUs is unbalanced as follows:
(detectron2) root@b06e1b5c1ffb:/SSD/hyunsu/detectron2# nvidia-smi
Sun Oct 13 16:49:43 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp On | 00000000:83:00.0 Off | N/A |
| 39% 65C P2 206W / 250W | 11539MiB / 12196MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp On | 00000000:84:00.0 Off | N/A |
| 40% 65C P2 206W / 250W | 8375MiB / 12196MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN Xp On | 00000000:87:00.0 Off | N/A |
| 47% 75C P2 236W / 250W | 11089MiB / 12196MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN Xp On | 00000000:88:00.0 Off | N/A |
| 49% 79C P2 257W / 250W | 8409MiB / 12196MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Expected behavior
It should work both single and multiple GPUs. But, I feel like it’s quite unstable when I use multiple GPU.
Environment
Please paste the output of python -m detectron2.utils.collect_env
.
/home/user/miniconda/envs/detectron2/bin/python: Error while finding module specification for 'detectron2.utils.collect_env.' (ModuleNotFoundError: __path__ attribute not found on 'detectron2.utils.collect_env' while trying to find 'detectron2.utils.collect_env.')
(detectron2) root@b06e1b5c1ffb:/SSD/hyunsu/detectron2# python -m detectron2.utils.collect_env .
--------------------- --------------------------------------------------
Python 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0]
Detectron2 Compiler GCC 5.4
DETECTRON2_ENV_MODULE <not set>
PyTorch 1.3.0
PyTorch Debug Build False
CUDA available True
GPU 0,1,2,3 TITAN Xp
Pillow 6.2.0
cv2 4.1.1
--------------------- --------------------------------------------------
PyTorch built with:
- GCC 7.3
- Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CUDA Runtime 10.1
- NVCC architecture flags: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_50,code=compute_50
- CuDNN 7.6.3
- Magma 2.5.1
- Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
Issue Analytics
- State:
- Created 4 years ago
- Comments:6
Top GitHub Comments
@ZhouDongliang did you check that the number of classes in your config file is correct? I had the same bug and I fixed it by setting
cfg.MODEL.ROI_HEADS.NUM_CLASSES
for my Faster Rcnn model to the correct valueI set this item in the json file, it is not possible, no matter the number of GPUs is 1 or 2. Who can answer me, when I train, this error always occurs and the training is interrupted, which is uncomfortable!