Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error : Default process group is not initialized

See original GitHub issue

Torch : 1.4.0

CUDA: 10.0

MMCV : 1.0.2

MMSEG: 0.5.0+1c3f547

small custom dataset

Config :

norm_cfg = dict(type=‘BN’, requires_grad=True)

model = dict(
    type='CascadeEncoderDecoder',
    num_stages=2,
    pretrained='open-mmlab://msra/hrnetv2_w18',
    backbone=dict(
        type='HRNet',
        norm_cfg=dict(type='SyncBN', requires_grad=True),
        norm_eval=False,
        extra=dict(
            stage1=dict(
                num_modules=1,
                num_branches=1,
                block='BOTTLENECK',
                num_blocks=(4, ),
                num_channels=(64, )),
            stage2=dict(
                num_modules=1,
                num_branches=2,
                block='BASIC',
                num_blocks=(4, 4),
                num_channels=(18, 36)),
            stage3=dict(
                num_modules=4,
                num_branches=3,
                block='BASIC',
                num_blocks=(4, 4, 4),
                num_channels=(18, 36, 72)),
            stage4=dict(
                num_modules=3,
                num_branches=4,
                block='BASIC',
                num_blocks=(4, 4, 4, 4),
                num_channels=(18, 36, 72, 144)))),
    decode_head=[
        dict(
            type='FCNHead',
            in_channels=[18, 36, 72, 144],
            channels=270,
            in_index=(0, 1, 2, 3),
            input_transform='resize_concat',
            kernel_size=1,
            num_convs=1,
            concat_input=False,
            dropout_ratio=-1,
            num_classes=8,
            norm_cfg=dict(type='SyncBN', requires_grad=True),
            align_corners=False,
            loss_decode=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
        dict(
            type='OCRHead',
            in_channels=[18, 36, 72, 144],
            in_index=(0, 1, 2, 3),
            input_transform='resize_concat',
            channels=512,
            ocr_channels=256,
            dropout_ratio=-1,
            num_classes=8,
            norm_cfg=dict(type='SyncBN', requires_grad=True),
            align_corners=False,
            loss_decode=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0))
    ])
train_cfg = dict()
test_cfg = dict(mode='whole')
dataset_type = 'Aircraft'
data_root = '/mmdetection_aircraft/data/segm/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (512, 1024)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(type='Resize', img_scale=(1024, 768), ratio_range=(0.5, 2.0)),
    dict(type='RandomCrop', crop_size=(512, 384), cat_max_ratio=0.75),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='PhotoMetricDistortion'),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Pad', size=(512, 384), pad_val=0, seg_pad_val=255),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1024, 768),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=5,
    workers_per_gpu=2,
    train=dict(
        type='Aircraft',
        data_root='/mmdetection_aircraft/data/segm/',
        img_dir='JPEGImages',
        ann_dir='SegmentationClass',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations'),
            dict(type='Resize', img_scale=(1024, 768), ratio_range=(0.5, 2.0)),
            dict(type='RandomCrop', crop_size=(512, 384), cat_max_ratio=0.75),
            dict(type='RandomFlip', flip_ratio=0.5),
            dict(type='PhotoMetricDistortion'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size=(512, 384), pad_val=0, seg_pad_val=255),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_semantic_seg'])
        ],
        split='train.txt'),
    val=dict(
        type='Aircraft',
        data_root='/mmdetection_aircraft/data/segm/',
        img_dir='JPEGImages',
        ann_dir='SegmentationClass',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1024, 768),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        split='val.txt'),
    test=dict(
        type='Aircraft',
        data_root='/mmdetection_aircraft/data/segm/',
        img_dir='JPEGImages',
        ann_dir='SegmentationClass',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1024, 768),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        split='val.txt'))
log_config = dict(
    interval=1, hooks=[dict(type='TextLoggerHook', by_epoch=False)])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = 'checkpoints/ocrnet_hr18_512x1024_40k_cityscapes_20200601_033320-401c5bdd.pth'
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
lr_config = dict(policy='poly', power=0.9, min_lr=0.0001, by_epoch=False)
total_iters = 3
checkpoint_config = dict(by_epoch=False, interval=3)
evaluation = dict(interval=3, metric='mIoU')
work_dir = './work_dirs/tutorial'
seed = 0
gpu_ids = [0]

TRAIN MODEL :

model = build_segmentor(
    cfg.model, train_cfg=cfg.train_cfg, test_cfg=cfg.test_cfg)
model.CLASSES = datasets[0].CLASSES
mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
train_segmentor(model, datasets, cfg, distributed=False, validate=True, 
                meta=dict())

#FULL error description:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-16-fec2661e1f4c> in <module>
     16 mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
     17 train_segmentor(model, datasets, cfg, distributed=False, validate=True, 
---> 18                 meta=dict())

~/mmsegmentation/mmseg/apis/train.py in train_segmentor(model, dataset, cfg, distributed, validate, timestamp, meta)
    104     elif cfg.load_from:
    105         runner.load_checkpoint(cfg.load_from)
--> 106     runner.run(data_loaders, cfg.workflow, cfg.total_iters)

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py in run(self, data_loaders, workflow, max_iters, **kwargs)
    117                     if mode == 'train' and self.iter >= max_iters:
    118                         return
--> 119                     iter_runner(iter_loaders[i], **kwargs)
    120 
    121         time.sleep(1)  # wait for some hooks like loggers to finish

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py in train(self, data_loader, **kwargs)
     53         self.call_hook('before_train_iter')
     54         data_batch = next(data_loader)
---> 55         outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
     56         if not isinstance(outputs, dict):
     57             raise TypeError('model.train_step() must return a dict')

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py in train_step(self, *inputs, **kwargs)
     29 
     30         inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
---> 31         return self.module.train_step(*inputs[0], **kwargs[0])
     32 
     33     def val_step(self, *inputs, **kwargs):

~/mmsegmentation/mmseg/models/segmentors/base.py in train_step(self, data_batch, optimizer, **kwargs)
    147                 averaging the logs.
    148         """
--> 149         losses = self.forward_train(**data_batch, **kwargs)
    150         loss, log_vars = self._parse_losses(losses)
    151 

~/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py in forward_train(self, img, img_metas, gt_semantic_seg)
    150         """
    151 
--> 152         x = self.extract_feat(img)
    153 
    154         losses = dict()

~/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py in extract_feat(self, img)
     76     def extract_feat(self, img):
     77         """Extract features from images."""
---> 78         x = self.backbone(img)
     79         if self.with_neck:
     80             x = self.neck(x)

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/mmsegmentation/mmseg/models/backbones/hrnet.py in forward(self, x)
    512 
    513         x = self.conv1(x)
--> 514         x = self.norm1(x)
    515         x = self.relu(x)
    516         x = self.conv2(x)

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py in forward(self, input)
    456             if self.process_group:
    457                 process_group = self.process_group
--> 458             world_size = torch.distributed.get_world_size(process_group)
    459             need_sync = world_size > 1
    460 

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in get_world_size(group)
    584         return -1
    585 
--> 586     return _get_group_size(group)
    587 
    588 

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in _get_group_size(group)
    200     """
    201     if group is GroupMember.WORLD:
--> 202         _check_default_pg()
    203         return _default_pg.size()
    204     if group not in _pg_group_ranks:

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in _check_default_pg()
    191     """
    192     assert _default_pg is not None, \
--> 193         "Default process group is not initialized"
    194 
    195 

AssertionError: Default process group is not initialized

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:13

Top GitHub Comments

1reaction

NingAnMecommented, Mar 18, 2021

Hi @rassabin In your config, norm_cfg in backbone and heads is SyncBN, which requires distributed training.

Can you pls specify how to solve this problem? Thanks in advance

change “SyncBN” to “BN” in “configs/base”

1reaction

rassabincommented, Jul 20, 2020

Hi @rassabin In your config, norm_cfg in backbone and heads is SyncBN, which requires distributed training.

Yeap, that helps. But it strange that we should to change norm_cfg parameter for each head seperatly as in backbone.

Top Results From Across the Web

Default process group is not initialized · Issue #131 · mapillary ...

AssertionError : Default process group is not initialized #131 ... And I have tried run it on both 1 GPU and 2 GPUs...

Default process group has not been initialized, please make ...

Hello, I've been trying to move a model from a single GPU to a machine I've rented with four GPUs. I used the...

PyTorch-Lightning/community - Gitter

hi, anyone knows how to debug "Default process group is not initialized" error when using dp mode? in torch.utils.data.distributed.DistributedSampler.

Error when using train.checkpoint - Ray

RaySystemError: System error : Default process group has not been initialized, please make sure to call init_process_group. traceback: Traceback ...

AssertionError: Default process group is not initialized

AssertionError : Default process group is not initialized ... 博主解决这个问题的方法为：如果项目里有分布式训练相关的代码，如果不使用分布式训练，就 ...