Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Creating custom data for training

See original GitHub issue

Hi, I defined a custom datasets with 6 classes, i train the datasets with deeplabv3plus, the config like below:

The custom data structure as follow:

├─ann_dir (8)
│  ├─train
│  └─val
└─img_dir (24)
    ├─train
    └─val

deeplabv3plus_r50-d8_512x1024_80k_cityscapes_SIR.py is create as follow:

norm_cfg = dict(type='SyncBN', requires_grad=True)
model = dict(
    type='EncoderDecoder',
    pretrained='open-mmlab://resnet50_v1c',
    backbone=dict(
        type='ResNetV1c',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        dilations=(1, 1, 2, 4),
        strides=(1, 2, 1, 1),
        norm_cfg=dict(type='SyncBN', requires_grad=True),
        norm_eval=False,
        style='pytorch',
        contract_dilation=True),
    decode_head=dict(
        type='DepthwiseSeparableASPPHead',
        in_channels=2048,
        in_index=3,
        channels=512,
        dilations=(1, 12, 24, 36),
        c1_in_channels=256,
        c1_channels=48,
        dropout_ratio=0.1,
        num_classes=6,
        norm_cfg=dict(type='SyncBN', requires_grad=True),
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
    auxiliary_head=dict(
        type='FCNHead',
        in_channels=1024,
        in_index=2,
        channels=256,
        num_convs=1,
        concat_input=False,
        dropout_ratio=0.1,
        num_classes=6,
        norm_cfg=dict(type='SyncBN', requires_grad=True),
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)))
train_cfg = dict()
test_cfg = dict(mode='whole')
dataset_type = 'CustomDataset'
data_root = 'data/SIRLab_mars/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (512, 512)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
    dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='PhotoMetricDistortion'),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(2048, 512),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=4,
    workers_per_gpu=4,
    train=dict(
        type='CustomDataset',
        data_root='data/SIRLab_mars/',
        img_dir='img_dir/train',
        ann_dir='ann_dir/train',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations'),
            dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
            dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75),
            dict(type='RandomFlip', flip_ratio=0.5),
            dict(type='PhotoMetricDistortion'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_semantic_seg'])
        ]),
    val=dict(
        type='CustomDataset',
        data_root='data/SIRLab_mars/',
        img_dir='img_dir/val',
        ann_dir='ann_dir/val',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(2048, 512),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]),
    test=dict(
        type='CustomDataset',
        data_root='data/SIRLab_mars/',
        img_dir='img_dir/val',
        ann_dir='ann_dir/val',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(2048, 512),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]))
log_config = dict(
    interval=50, hooks=[dict(type='TextLoggerHook', by_epoch=False)])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
lr_config = dict(policy='poly', power=0.9, min_lr=0.0001, by_epoch=False)
total_iters = 80000
checkpoint_config = dict(by_epoch=False, interval=8000)
evaluation = dict(interval=8000, metric='mIoU')
work_dir = './work_dirs/deeplabv3plus_r50-d8_512x1024_80k_cityscapes_SIR'
gpu_ids = range(0, 1)

but, i got a problem which 5 class results is NAN after 80000 iters. the training log is pasted below:

2020-10-09 11:11:05,837 - mmseg - INFO - Loaded 1090 images
2020-10-09 11:11:06,461 - mmseg - INFO - Loaded 123 images
2020-10-09 11:11:06,462 - mmseg - INFO - Start running, work_dir: /mmsegmentation/work_dirs/deeplabv3plus_r50-d8_512x1024_80k_cityscapes_SIR
2020-10-09 11:11:06,462 - mmseg - INFO - workflow: [('train', 1)], max: 80000 iters
2020-10-09 11:11:48,158 - mmseg - INFO - Iter [50/80000]        lr: 9.995e-03, eta: 14:02:00, time: 0.632, data_time: 0.005, memory: 20292, decode.loss_seg: 0.0674, decode.acc_seg: 89.3073, aux.loss_seg: 0.0679, aux.acc_seg: 87.8143, loss: 0.1354
2020-10-09 11:12:11,565 - mmseg - INFO - Iter [100/80000]       lr: 9.989e-03, eta: 12:12:26, time: 0.468, data_time: 0.005, memory: 20292, decode.loss_seg: 0.0000, decode.acc_seg: 92.3593, aux.loss_seg: 0.0004, aux.acc_seg: 92.3593, loss: 0.0004
2020-10-09 11:12:47,120 - mmseg - INFO - Iter [150/80000]       lr: 9.983e-03, eta: 13:23:25, time: 0.711, data_time: 0.005, memory: 20292, decode.loss_seg: 0.0000, decode.acc_seg: 90.7240, aux.loss_seg: 0.0003, aux.acc_seg: 90.7240, loss: 0.0003
...
2020-10-09 23:47:14,564 - mmseg - INFO - Iter [79700/80000]     lr: 1.651e-04, eta: 0:02:49, time: 0.737, data_time: 0.006, memory: 20292, decode.loss_seg: 0.0005, decode.acc_seg: 89.8289, aux.loss_seg: 0.0005, aux.acc_seg: 89.8289, loss: 0.0010
2020-10-09 23:47:38,209 - mmseg - INFO - Iter [79750/80000]     lr: 1.553e-04, eta: 0:02:21, time: 0.473, data_time: 0.006, memory: 20292, decode.loss_seg: 0.0006, decode.acc_seg: 91.7836, aux.loss_seg: 0.0005, aux.acc_seg: 91.7836, loss: 0.0011
2020-10-09 23:48:01,839 - mmseg - INFO - Iter [79800/80000]     lr: 1.453e-04, eta: 0:01:53, time: 0.473, data_time: 0.006, memory: 20292, decode.loss_seg: 0.0006, decode.acc_seg: 91.9399, aux.loss_seg: 0.0005, aux.acc_seg: 91.9399, loss: 0.0012
2020-10-09 23:48:37,906 - mmseg - INFO - Iter [79850/80000]     lr: 1.350e-04, eta: 0:01:24, time: 0.721, data_time: 0.006, memory: 20292, decode.loss_seg: 0.0006, decode.acc_seg: 92.1883, aux.loss_seg: 0.0005, aux.acc_seg: 92.1883, loss: 0.0012
2020-10-09 23:49:01,616 - mmseg - INFO - Iter [79900/80000]     lr: 1.244e-04, eta: 0:00:56, time: 0.474, data_time: 0.006, memory: 20292, decode.loss_seg: 0.0006, decode.acc_seg: 91.7220, aux.loss_seg: 0.0005, aux.acc_seg: 91.7220, loss: 0.0011
2020-10-09 23:49:25,386 - mmseg - INFO - Iter [79950/80000]     lr: 1.132e-04, eta: 0:00:28, time: 0.475, data_time: 0.006, memory: 20292, decode.loss_seg: 0.0006, decode.acc_seg: 92.6604, aux.loss_seg: 0.0006, aux.acc_seg: 92.6604, loss: 0.0012
2020-10-09 23:50:05,197 - mmseg - INFO - Saving checkpoint at 80000 iterations
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 124/123, 16.2 task/s, elapsed: 8s, ETA:     0s

2020-10-09 23:50:39,039 - mmseg - INFO - per class results:
Class                  IoU        Acc
bedrock             100.00     100.00
stone                  nan        nan
gravel                 nan        nan
sand                   nan        nan
soil                   nan        nan
others                 nan        nan
Summary:
Scope                 mIoU       mAcc       aAcc
global              100.00     100.00     100.00

2020-10-09 23:50:39,095 - mmseg - INFO - Exp name: deeplabv3plus_r50-d8_512x1024_80k_cityscapes_SIR.py
2020-10-09 23:50:39,095 - mmseg - INFO - Iter(val) [80000]      mIoU: 1.0000, mAcc: 1.0000, aAcc: 1.0000