Error : Default process group is not initialized
See original GitHub issueTorch : 1.4.0
CUDA: 10.0
MMCV : 1.0.2
MMSEG: 0.5.0+1c3f547
small custom dataset
Config :
norm_cfg = dict(type=‘BN’, requires_grad=True)
model = dict(
type='CascadeEncoderDecoder',
num_stages=2,
pretrained='open-mmlab://msra/hrnetv2_w18',
backbone=dict(
type='HRNet',
norm_cfg=dict(type='SyncBN', requires_grad=True),
norm_eval=False,
extra=dict(
stage1=dict(
num_modules=1,
num_branches=1,
block='BOTTLENECK',
num_blocks=(4, ),
num_channels=(64, )),
stage2=dict(
num_modules=1,
num_branches=2,
block='BASIC',
num_blocks=(4, 4),
num_channels=(18, 36)),
stage3=dict(
num_modules=4,
num_branches=3,
block='BASIC',
num_blocks=(4, 4, 4),
num_channels=(18, 36, 72)),
stage4=dict(
num_modules=3,
num_branches=4,
block='BASIC',
num_blocks=(4, 4, 4, 4),
num_channels=(18, 36, 72, 144)))),
decode_head=[
dict(
type='FCNHead',
in_channels=[18, 36, 72, 144],
channels=270,
in_index=(0, 1, 2, 3),
input_transform='resize_concat',
kernel_size=1,
num_convs=1,
concat_input=False,
dropout_ratio=-1,
num_classes=8,
norm_cfg=dict(type='SyncBN', requires_grad=True),
align_corners=False,
loss_decode=dict(
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
dict(
type='OCRHead',
in_channels=[18, 36, 72, 144],
in_index=(0, 1, 2, 3),
input_transform='resize_concat',
channels=512,
ocr_channels=256,
dropout_ratio=-1,
num_classes=8,
norm_cfg=dict(type='SyncBN', requires_grad=True),
align_corners=False,
loss_decode=dict(
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0))
])
train_cfg = dict()
test_cfg = dict(mode='whole')
dataset_type = 'Aircraft'
data_root = '/mmdetection_aircraft/data/segm/'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (512, 1024)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations'),
dict(type='Resize', img_scale=(1024, 768), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=(512, 384), cat_max_ratio=0.75),
dict(type='RandomFlip', flip_ratio=0.5),
dict(type='PhotoMetricDistortion'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size=(512, 384), pad_val=0, seg_pad_val=255),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1024, 768),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]
data = dict(
samples_per_gpu=5,
workers_per_gpu=2,
train=dict(
type='Aircraft',
data_root='/mmdetection_aircraft/data/segm/',
img_dir='JPEGImages',
ann_dir='SegmentationClass',
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations'),
dict(type='Resize', img_scale=(1024, 768), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=(512, 384), cat_max_ratio=0.75),
dict(type='RandomFlip', flip_ratio=0.5),
dict(type='PhotoMetricDistortion'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size=(512, 384), pad_val=0, seg_pad_val=255),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg'])
],
split='train.txt'),
val=dict(
type='Aircraft',
data_root='/mmdetection_aircraft/data/segm/',
img_dir='JPEGImages',
ann_dir='SegmentationClass',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1024, 768),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
],
split='val.txt'),
test=dict(
type='Aircraft',
data_root='/mmdetection_aircraft/data/segm/',
img_dir='JPEGImages',
ann_dir='SegmentationClass',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1024, 768),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
],
split='val.txt'))
log_config = dict(
interval=1, hooks=[dict(type='TextLoggerHook', by_epoch=False)])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = 'checkpoints/ocrnet_hr18_512x1024_40k_cityscapes_20200601_033320-401c5bdd.pth'
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
lr_config = dict(policy='poly', power=0.9, min_lr=0.0001, by_epoch=False)
total_iters = 3
checkpoint_config = dict(by_epoch=False, interval=3)
evaluation = dict(interval=3, metric='mIoU')
work_dir = './work_dirs/tutorial'
seed = 0
gpu_ids = [0]
TRAIN MODEL :
model = build_segmentor(
cfg.model, train_cfg=cfg.train_cfg, test_cfg=cfg.test_cfg)
model.CLASSES = datasets[0].CLASSES
mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
train_segmentor(model, datasets, cfg, distributed=False, validate=True,
meta=dict())
#FULL error description:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-16-fec2661e1f4c> in <module>
16 mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
17 train_segmentor(model, datasets, cfg, distributed=False, validate=True,
---> 18 meta=dict())
~/mmsegmentation/mmseg/apis/train.py in train_segmentor(model, dataset, cfg, distributed, validate, timestamp, meta)
104 elif cfg.load_from:
105 runner.load_checkpoint(cfg.load_from)
--> 106 runner.run(data_loaders, cfg.workflow, cfg.total_iters)
~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py in run(self, data_loaders, workflow, max_iters, **kwargs)
117 if mode == 'train' and self.iter >= max_iters:
118 return
--> 119 iter_runner(iter_loaders[i], **kwargs)
120
121 time.sleep(1) # wait for some hooks like loggers to finish
~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py in train(self, data_loader, **kwargs)
53 self.call_hook('before_train_iter')
54 data_batch = next(data_loader)
---> 55 outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
56 if not isinstance(outputs, dict):
57 raise TypeError('model.train_step() must return a dict')
~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py in train_step(self, *inputs, **kwargs)
29
30 inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
---> 31 return self.module.train_step(*inputs[0], **kwargs[0])
32
33 def val_step(self, *inputs, **kwargs):
~/mmsegmentation/mmseg/models/segmentors/base.py in train_step(self, data_batch, optimizer, **kwargs)
147 averaging the logs.
148 """
--> 149 losses = self.forward_train(**data_batch, **kwargs)
150 loss, log_vars = self._parse_losses(losses)
151
~/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py in forward_train(self, img, img_metas, gt_semantic_seg)
150 """
151
--> 152 x = self.extract_feat(img)
153
154 losses = dict()
~/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py in extract_feat(self, img)
76 def extract_feat(self, img):
77 """Extract features from images."""
---> 78 x = self.backbone(img)
79 if self.with_neck:
80 x = self.neck(x)
~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)
~/mmsegmentation/mmseg/models/backbones/hrnet.py in forward(self, x)
512
513 x = self.conv1(x)
--> 514 x = self.norm1(x)
515 x = self.relu(x)
516 x = self.conv2(x)
~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)
~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py in forward(self, input)
456 if self.process_group:
457 process_group = self.process_group
--> 458 world_size = torch.distributed.get_world_size(process_group)
459 need_sync = world_size > 1
460
~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in get_world_size(group)
584 return -1
585
--> 586 return _get_group_size(group)
587
588
~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in _get_group_size(group)
200 """
201 if group is GroupMember.WORLD:
--> 202 _check_default_pg()
203 return _default_pg.size()
204 if group not in _pg_group_ranks:
~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in _check_default_pg()
191 """
192 assert _default_pg is not None, \
--> 193 "Default process group is not initialized"
194
195
AssertionError: Default process group is not initialized
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:13
Top Results From Across the Web
Default process group is not initialized · Issue #131 · mapillary ...
AssertionError : Default process group is not initialized #131 ... And I have tried run it on both 1 GPU and 2 GPUs...
Read more >Default process group has not been initialized, please make ...
Hello, I've been trying to move a model from a single GPU to a machine I've rented with four GPUs. I used the...
Read more >PyTorch-Lightning/community - Gitter
hi, anyone knows how to debug "Default process group is not initialized" error when using dp mode? in torch.utils.data.distributed.DistributedSampler.
Read more >Error when using train.checkpoint - Ray
RaySystemError: System error : Default process group has not been initialized, please make sure to call init_process_group. traceback: Traceback ...
Read more >AssertionError: Default process group is not initialized
AssertionError : Default process group is not initialized ... 博主解决这个问题的方法为:如果项目里有分布式训练相关的代码,如果不使用分布式训练,就 ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
change “SyncBN” to “BN” in “configs/base”
Yeap, that helps. But it strange that we should to change norm_cfg parameter for each head seperatly as in backbone.