question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error : Default process group is not initialized

See original GitHub issue

Torch : 1.4.0

CUDA: 10.0

MMCV : 1.0.2

MMSEG: 0.5.0+1c3f547

small custom dataset

Config :

norm_cfg = dict(type=‘BN’, requires_grad=True)

model = dict(
    type='CascadeEncoderDecoder',
    num_stages=2,
    pretrained='open-mmlab://msra/hrnetv2_w18',
    backbone=dict(
        type='HRNet',
        norm_cfg=dict(type='SyncBN', requires_grad=True),
        norm_eval=False,
        extra=dict(
            stage1=dict(
                num_modules=1,
                num_branches=1,
                block='BOTTLENECK',
                num_blocks=(4, ),
                num_channels=(64, )),
            stage2=dict(
                num_modules=1,
                num_branches=2,
                block='BASIC',
                num_blocks=(4, 4),
                num_channels=(18, 36)),
            stage3=dict(
                num_modules=4,
                num_branches=3,
                block='BASIC',
                num_blocks=(4, 4, 4),
                num_channels=(18, 36, 72)),
            stage4=dict(
                num_modules=3,
                num_branches=4,
                block='BASIC',
                num_blocks=(4, 4, 4, 4),
                num_channels=(18, 36, 72, 144)))),
    decode_head=[
        dict(
            type='FCNHead',
            in_channels=[18, 36, 72, 144],
            channels=270,
            in_index=(0, 1, 2, 3),
            input_transform='resize_concat',
            kernel_size=1,
            num_convs=1,
            concat_input=False,
            dropout_ratio=-1,
            num_classes=8,
            norm_cfg=dict(type='SyncBN', requires_grad=True),
            align_corners=False,
            loss_decode=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
        dict(
            type='OCRHead',
            in_channels=[18, 36, 72, 144],
            in_index=(0, 1, 2, 3),
            input_transform='resize_concat',
            channels=512,
            ocr_channels=256,
            dropout_ratio=-1,
            num_classes=8,
            norm_cfg=dict(type='SyncBN', requires_grad=True),
            align_corners=False,
            loss_decode=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0))
    ])
train_cfg = dict()
test_cfg = dict(mode='whole')
dataset_type = 'Aircraft'
data_root = '/mmdetection_aircraft/data/segm/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (512, 1024)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(type='Resize', img_scale=(1024, 768), ratio_range=(0.5, 2.0)),
    dict(type='RandomCrop', crop_size=(512, 384), cat_max_ratio=0.75),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='PhotoMetricDistortion'),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Pad', size=(512, 384), pad_val=0, seg_pad_val=255),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1024, 768),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=5,
    workers_per_gpu=2,
    train=dict(
        type='Aircraft',
        data_root='/mmdetection_aircraft/data/segm/',
        img_dir='JPEGImages',
        ann_dir='SegmentationClass',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations'),
            dict(type='Resize', img_scale=(1024, 768), ratio_range=(0.5, 2.0)),
            dict(type='RandomCrop', crop_size=(512, 384), cat_max_ratio=0.75),
            dict(type='RandomFlip', flip_ratio=0.5),
            dict(type='PhotoMetricDistortion'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size=(512, 384), pad_val=0, seg_pad_val=255),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_semantic_seg'])
        ],
        split='train.txt'),
    val=dict(
        type='Aircraft',
        data_root='/mmdetection_aircraft/data/segm/',
        img_dir='JPEGImages',
        ann_dir='SegmentationClass',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1024, 768),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        split='val.txt'),
    test=dict(
        type='Aircraft',
        data_root='/mmdetection_aircraft/data/segm/',
        img_dir='JPEGImages',
        ann_dir='SegmentationClass',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1024, 768),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        split='val.txt'))
log_config = dict(
    interval=1, hooks=[dict(type='TextLoggerHook', by_epoch=False)])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = 'checkpoints/ocrnet_hr18_512x1024_40k_cityscapes_20200601_033320-401c5bdd.pth'
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
lr_config = dict(policy='poly', power=0.9, min_lr=0.0001, by_epoch=False)
total_iters = 3
checkpoint_config = dict(by_epoch=False, interval=3)
evaluation = dict(interval=3, metric='mIoU')
work_dir = './work_dirs/tutorial'
seed = 0
gpu_ids = [0]

TRAIN MODEL :

model = build_segmentor(
    cfg.model, train_cfg=cfg.train_cfg, test_cfg=cfg.test_cfg)
model.CLASSES = datasets[0].CLASSES
mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
train_segmentor(model, datasets, cfg, distributed=False, validate=True, 
                meta=dict())

#FULL error description:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-16-fec2661e1f4c> in <module>
     16 mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
     17 train_segmentor(model, datasets, cfg, distributed=False, validate=True, 
---> 18                 meta=dict())

~/mmsegmentation/mmseg/apis/train.py in train_segmentor(model, dataset, cfg, distributed, validate, timestamp, meta)
    104     elif cfg.load_from:
    105         runner.load_checkpoint(cfg.load_from)
--> 106     runner.run(data_loaders, cfg.workflow, cfg.total_iters)

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py in run(self, data_loaders, workflow, max_iters, **kwargs)
    117                     if mode == 'train' and self.iter >= max_iters:
    118                         return
--> 119                     iter_runner(iter_loaders[i], **kwargs)
    120 
    121         time.sleep(1)  # wait for some hooks like loggers to finish

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py in train(self, data_loader, **kwargs)
     53         self.call_hook('before_train_iter')
     54         data_batch = next(data_loader)
---> 55         outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
     56         if not isinstance(outputs, dict):
     57             raise TypeError('model.train_step() must return a dict')

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py in train_step(self, *inputs, **kwargs)
     29 
     30         inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
---> 31         return self.module.train_step(*inputs[0], **kwargs[0])
     32 
     33     def val_step(self, *inputs, **kwargs):

~/mmsegmentation/mmseg/models/segmentors/base.py in train_step(self, data_batch, optimizer, **kwargs)
    147                 averaging the logs.
    148         """
--> 149         losses = self.forward_train(**data_batch, **kwargs)
    150         loss, log_vars = self._parse_losses(losses)
    151 

~/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py in forward_train(self, img, img_metas, gt_semantic_seg)
    150         """
    151 
--> 152         x = self.extract_feat(img)
    153 
    154         losses = dict()

~/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py in extract_feat(self, img)
     76     def extract_feat(self, img):
     77         """Extract features from images."""
---> 78         x = self.backbone(img)
     79         if self.with_neck:
     80             x = self.neck(x)

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/mmsegmentation/mmseg/models/backbones/hrnet.py in forward(self, x)
    512 
    513         x = self.conv1(x)
--> 514         x = self.norm1(x)
    515         x = self.relu(x)
    516         x = self.conv2(x)

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py in forward(self, input)
    456             if self.process_group:
    457                 process_group = self.process_group
--> 458             world_size = torch.distributed.get_world_size(process_group)
    459             need_sync = world_size > 1
    460 

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in get_world_size(group)
    584         return -1
    585 
--> 586     return _get_group_size(group)
    587 
    588 

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in _get_group_size(group)
    200     """
    201     if group is GroupMember.WORLD:
--> 202         _check_default_pg()
    203         return _default_pg.size()
    204     if group not in _pg_group_ranks:

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in _check_default_pg()
    191     """
    192     assert _default_pg is not None, \
--> 193         "Default process group is not initialized"
    194 
    195 

AssertionError: Default process group is not initialized

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:13

github_iconTop GitHub Comments

1reaction
NingAnMecommented, Mar 18, 2021

Hi @rassabin In your config, norm_cfg in backbone and heads is SyncBN, which requires distributed training.

Can you pls specify how to solve this problem? Thanks in advance

change “SyncBN” to “BN” in “configs/base

1reaction
rassabincommented, Jul 20, 2020

Hi @rassabin In your config, norm_cfg in backbone and heads is SyncBN, which requires distributed training.

Yeap, that helps. But it strange that we should to change norm_cfg parameter for each head seperatly as in backbone.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Default process group is not initialized · Issue #131 · mapillary ...
AssertionError : Default process group is not initialized #131 ... And I have tried run it on both 1 GPU and 2 GPUs...
Read more >
Default process group has not been initialized, please make ...
Hello, I've been trying to move a model from a single GPU to a machine I've rented with four GPUs. I used the...
Read more >
PyTorch-Lightning/community - Gitter
hi, anyone knows how to debug "Default process group is not initialized" error when using dp mode? in torch.utils.data.distributed.DistributedSampler.
Read more >
Error when using train.checkpoint - Ray
RaySystemError: System error : Default process group has not been initialized, please make sure to call init_process_group. traceback: Traceback ...
Read more >
AssertionError: Default process group is not initialized
AssertionError : Default process group is not initialized ... 博主解决这个问题的方法为:如果项目里有分布式训练相关的代码,如果不使用分布式训练,就 ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found