The problem occurred as pytorch moved some code around for the reducer in #44798 which broke MMDistributedDataParallel.
See original GitHub issueI have report the bug to Pytorch, and I have get some feedback.
@uniartisan Looks like this might be an issue with how mmcv is using DDP. MMDistributedDataParallel actually uses the reducer directly and mimics some of the DDP logic. This is really not supported and frameworks shouldn’t be using the reducer directly. The problem occurred as we moved some code around for the reducer in #44798 which broke MMDistributedDataParallel.
If you add this change to mmcv, the issue is resolved (although mmcv shouldn’t be doing this in the first place):
diff --git a/mmcv/parallel/distributed.py b/mmcv/parallel/distributed.py
index 07771a3..60e2f31 100644
--- a/mmcv/parallel/distributed.py
+++ b/mmcv/parallel/distributed.py
@@ -28,6 +28,9 @@ class MMDistributedDataParallel(DistributedDataParallel):
``self.module.forward()`` with ``self.module.train_step()``.
It is compatible with PyTorch 1.1 - 1.5.
"""
+ if self.reducer._rebuild_buckets():
+ print("Reducer buckets have been rebuilt in this iteration.")
+
if getattr(self, 'require_forward_param_sync', True):
self._sync_params()
if self.device_ids:
The following is the error report when I run mmdetection
1.conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
install latest mmcv mmdetection
./tools/dist_train.sh configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py 2
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
2020-10-29 21:24:04,114 - mmdet - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0]
CUDA available: True
GPU 0: P106-100
GPU 1: GeForce GTX 1060 6GB
CUDA_HOME: /usr
NVCC: Build cuda_11.0_bu.TC445_37.28845127_0
GCC: gcc (Ubuntu 10.2.0-13ubuntu1) 10.2.0
PyTorch: 1.7.0
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 10.2
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
- CuDNN 7.6.5
- Magma 2.5.2
- Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
TorchVision: 0.8.1
OpenCV: 4.4.0
MMCV: 1.1.6
MMCV Compiler: GCC 10.2
MMCV CUDA Compiler: 11.0
MMDetection: 2.5.0+ac20ffe
------------------------------------------------------------
2020-10-29 21:24:04,456 - mmdet - INFO - Distributed training: True
2020-10-29 21:24:04,794 - mmdet - INFO - Config:
model = dict(
type='FasterRCNN',
pretrained='torchvision://resnet50',
backbone=dict(
type='ResNet',
depth=50,
num_stages=4,
out_indices=(0, 1, 2, 3),
frozen_stages=1,
norm_cfg=dict(type='BN', requires_grad=True),
norm_eval=True,
style='pytorch'),
neck=dict(
type='FPN',
in_channels=[256, 512, 1024, 2048],
out_channels=256,
num_outs=5),
rpn_head=dict(
type='RPNHead',
in_channels=256,
feat_channels=256,
anchor_generator=dict(
type='AnchorGenerator',
scales=[8],
ratios=[0.5, 1.0, 2.0],
strides=[4, 8, 16, 32, 64]),
bbox_coder=dict(
type='DeltaXYWHBBoxCoder',
target_means=[0.0, 0.0, 0.0, 0.0],
target_stds=[1.0, 1.0, 1.0, 1.0]),
loss_cls=dict(
type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
roi_head=dict(
type='StandardRoIHead',
bbox_roi_extractor=dict(
type='SingleRoIExtractor',
roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0),
out_channels=256,
featmap_strides=[4, 8, 16, 32]),
bbox_head=dict(
type='Shared2FCBBoxHead',
in_channels=256,
fc_out_channels=1024,
roi_feat_size=7,
num_classes=2,
bbox_coder=dict(
type='DeltaXYWHBBoxCoder',
target_means=[0.0, 0.0, 0.0, 0.0],
target_stds=[0.1, 0.1, 0.2, 0.2]),
reg_class_agnostic=False,
loss_cls=dict(
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
loss_bbox=dict(type='L1Loss', loss_weight=1.0))))
train_cfg = dict(
rpn=dict(
assigner=dict(
type='MaxIoUAssigner',
pos_iou_thr=0.7,
neg_iou_thr=0.3,
min_pos_iou=0.3,
match_low_quality=True,
ignore_iof_thr=-1),
sampler=dict(
type='RandomSampler',
num=256,
pos_fraction=0.5,
neg_pos_ub=-1,
add_gt_as_proposals=False),
allowed_border=-1,
pos_weight=-1,
debug=False),
rpn_proposal=dict(
nms_across_levels=False,
nms_pre=2000,
nms_post=1000,
max_num=1000,
nms_thr=0.7,
min_bbox_size=0),
rcnn=dict(
assigner=dict(
type='MaxIoUAssigner',
pos_iou_thr=0.5,
neg_iou_thr=0.5,
min_pos_iou=0.5,
match_low_quality=False,
ignore_iof_thr=-1),
sampler=dict(
type='RandomSampler',
num=512,
pos_fraction=0.25,
neg_pos_ub=-1,
add_gt_as_proposals=True),
pos_weight=-1,
debug=False))
test_cfg = dict(
rpn=dict(
nms_across_levels=False,
nms_pre=1000,
nms_post=1000,
max_num=1000,
nms_thr=0.7,
min_bbox_size=0),
rcnn=dict(
score_thr=0.05,
nms=dict(type='nms', iou_threshold=0.5),
max_per_img=100))
dataset_type = 'CocoDataset'
data_root = 'data/coco/'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]
data = dict(
samples_per_gpu=2,
workers_per_gpu=2,
train=dict(
type='CocoDataset',
ann_file='data/coco/annotations/instances_train2017.json',
img_prefix='data/coco/train2017/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]),
val=dict(
type='CocoDataset',
ann_file='data/coco/annotations/instances_val2017.json',
img_prefix='data/coco/val2017/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]),
test=dict(
type='CocoDataset',
ann_file='data/coco/annotations/instances_val2017.json',
img_prefix='data/coco/val2017/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]))
evaluation = dict(interval=1, metric='bbox')
optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
lr_config = dict(
policy='step',
warmup='linear',
warmup_iters=500,
warmup_ratio=0.001,
step=[8, 11])
total_epochs = 12
checkpoint_config = dict(interval=1)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
work_dir = './work_dirs/faster_rcnn_r50_fpn_1x_coco'
gpu_ids = range(0, 1)
2020-10-29 21:24:05,218 - mmdet - INFO - load model from: torchvision://resnet50
2020-10-29 21:24:05,519 - mmdet - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: fc.weight, fc.bias
loading annotations into memory...
Done (t=0.01s)
creating index...
index created!
loading annotations into memory...
Done (t=0.01s)
creating index...
index created!
loading annotations into memory...
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
Done (t=0.00s)
creating index...
index created!
2020-10-29 21:24:06,086 - mmdet - INFO - Start running, host: lizhiyuan@lizhiyuan-Server, work_dir: /mnt/Data/workspace/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x_coco
2020-10-29 21:24:06,086 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs
2020-10-29 21:24:49,980 - mmdet - INFO - Epoch [1][50/146] lr: 1.978e-03, eta: 0:24:49, time: 0.875, data_time: 0.125, memory: 3324, loss_rpn_cls: 0.3634, loss_rpn_bbox: 0.0203, loss_cls: 0.2062, acc: 95.9912, loss_bbox: 0.0038, loss: 0.5936
2020-10-29 21:25:28,628 - mmdet - INFO - Epoch [1][100/146] lr: 3.976e-03, eta: 0:22:41, time: 0.773, data_time: 0.008, memory: 3325, loss_rpn_cls: 0.1206, loss_rpn_bbox: 0.0171, loss_cls: 0.0500, acc: 99.2227, loss_bbox: 0.0144, loss: 0.2021
2020-10-29 21:26:04,092 - mmdet - INFO - Saving checkpoint at 1 epochs
[ ] 0/80, elapsed: 0s, ETA:Traceback (most recent call last):
File "./tools/train.py", line 178, in <module>
main()
File "./tools/train.py", line 167, in main
train_detector(
File "/mnt/Data/workspace/mmdetection/mmdet/apis/train.py", line 150, in train_detector
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/lizhiyuan/anaconda3/envs/mmcv/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/lizhiyuan/anaconda3/envs/mmcv/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
self.call_hook('after_train_epoch')
File "/home/lizhiyuan/anaconda3/envs/mmcv/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/mnt/Data/workspace/mmdetection/mmdet/core/evaluation/eval_hooks.py", line 125, in after_train_epoch
results = multi_gpu_test(
File "/mnt/Data/workspace/mmdetection/mmdet/apis/test.py", line 97, in multi_gpu_test
result = model(return_loss=False, rescale=True, **data)
File "/home/lizhiyuan/anaconda3/envs/mmcv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/lizhiyuan/anaconda3/envs/mmcv/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 606, in forward
if self.reducer._rebuild_buckets():
RuntimeError: replicas_[0].size() == rebuilt_param_indices_.size() INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1603729062494/work/torch/csrc/distributed/c10d/reducer.cpp":1326, please report a bug to PyTorch. rebuilt parameter indices size is not same as original model parameters size.156 versus 22776
Traceback (most recent call last):
File "./tools/train.py", line 178, in <module>
main()
File "./tools/train.py", line 167, in main
train_detector(
File "/mnt/Data/workspace/mmdetection/mmdet/apis/train.py", line 150, in train_detector
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/lizhiyuan/anaconda3/envs/mmcv/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/lizhiyuan/anaconda3/envs/mmcv/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
self.call_hook('after_train_epoch')
File "/home/lizhiyuan/anaconda3/envs/mmcv/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/mnt/Data/workspace/mmdetection/mmdet/core/evaluation/eval_hooks.py", line 125, in after_train_epoch
results = multi_gpu_test(
File "/mnt/Data/workspace/mmdetection/mmdet/apis/test.py", line 97, in multi_gpu_test
result = model(return_loss=False, rescale=True, **data)
File "/home/lizhiyuan/anaconda3/envs/mmcv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/lizhiyuan/anaconda3/envs/mmcv/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 606, in forward
if self.reducer._rebuild_buckets():
RuntimeError: replicas_[0].size() == rebuilt_param_indices_.size() INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1603729062494/work/torch/csrc/distributed/c10d/reducer.cpp":1326, please report a bug to PyTorch. rebuilt parameter indices size is not same as original model parameters size.156 versus 22776
Traceback (most recent call last):
File "/home/lizhiyuan/anaconda3/envs/mmcv/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/lizhiyuan/anaconda3/envs/mmcv/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/lizhiyuan/anaconda3/envs/mmcv/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in <module>
main()
File "/home/lizhiyuan/anaconda3/envs/mmcv/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/home/lizhiyuan/anaconda3/envs/mmcv/bin/python', '-u', './tools/train.py', '--local_rank=1', 'configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py', '--launcher', 'pytorch']' returned non-zero exit status 1.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Get Started with PyTorch 2.0 Summary and Overview
Introducing PyTorch 2.0, our first steps toward the next generation 2-series release of PyTorch. Over the last few years we have innovated and ......
Read more >Meta spins off PyTorch Foundation to make AI framework ...
It will operate as part of the nonprofit Linux Foundation, and its governing board includes representatives from Nvidia, Meta, Google, Microsoft ...
Read more >[D] PyTorch is moving to the Linux Foundation - Reddit
At least some of them would probably be reluctant to contribute directly ro a Facebook project, but would be enthusiastic about contributing to ......
Read more >Announcing the PyTorch Foundation to Accelerate Progress in ...
PyTorch, one of the leading community-driven AI research frameworks, is moving to a new, independent PyTorch Foundation that will be part of the...
Read more >Dave Viner's Post - LinkedIn
PyTorch has moved to the PyTorch Foundation! As Mark posted earlier today, we announced that PyTorch is moving to a new, independent PyTorch...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi @uniartisan , Thanks for your bug report. We will fix the MMDDP to catch up with PyTorch 1.7. As a workaround, we suggest you use PyTorch 1.6 for now.
I’m wondering if this can be resolved by wrapping module during training time with a wrapper module and this wrapper module’s forward() method calls module.train_step() instead? It is fragile to use the internal APIs and a future refactor on our end might break this again.