question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Train fcos_r50_caffe_fpn_gn_1x_4gpu only get 33.6 AP

See original GitHub issue

Can’t get the model zoo’s 36.9AP

Training option is : python tools/train.py own/configs/fcos/own_fcos_r50_caffe_fpn_gn_1x_4gpu.py --gpus 2 --work_dir own/work/fcos/resnet50/coco17

Trained with 2GPU. The config is below, I modified the norm_cfg and lr.

# model settings
model = dict(
    type='FCOS',
    pretrained='open-mmlab://resnet50_caffe',
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        norm_cfg=dict(type='GN', num_groups=32, requires_grad=True),
        style='caffe'),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        start_level=1,
        add_extra_convs=True,
        extra_convs_on_inputs=False,  # use P5
        num_outs=5,
        relu_before_extra_convs=True),
    bbox_head=dict(
        type='FCOSHead',
        num_classes=81,
        in_channels=256,
        stacked_convs=4,
        feat_channels=256,
        strides=[8, 16, 32, 64, 128]))
# training and testing settings
train_cfg = dict(
    assigner=dict(
        type='MaxIoUAssigner',
        pos_iou_thr=0.5,
        neg_iou_thr=0.4,
        min_pos_iou=0,
        ignore_iof_thr=-1),
    smoothl1_beta=0.11,
    gamma=2.0,
    alpha=0.25,
    allowed_border=-1,
    pos_weight=-1,
    debug=False)
test_cfg = dict(
    nms_pre=1000,
    min_bbox_size=0,
    score_thr=0.05,
    nms=dict(type='nms', iou_thr=0.5),
    max_per_img=100)
# dataset settings
dataset_type = 'CocoDataset'
data_root = '/deep3/coco/'
img_norm_cfg = dict(
    mean=[102.9801, 115.9465, 122.7717], std=[1.0, 1.0, 1.0], to_rgb=False)
data = dict(
    imgs_per_gpu=4,
    workers_per_gpu=4,
    train=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/instances_train2017.json',
        img_prefix=data_root + 'train2017/',
        img_scale=(1333, 800),
        img_norm_cfg=img_norm_cfg,
        size_divisor=32,
        flip_ratio=0.5,
        with_mask=False,
        with_crowd=False,
        with_label=True),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/instances_val2017.json',
        img_prefix=data_root + 'val2017/',
        img_scale=(1333, 800),
        img_norm_cfg=img_norm_cfg,
        size_divisor=32,
        flip_ratio=0,
        with_mask=False,
        with_crowd=False,
        with_label=True),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/instances_val2017.json',
        img_prefix=data_root + 'val2017/',
        img_scale=(1333, 800),
        img_norm_cfg=img_norm_cfg,
        size_divisor=32,
        flip_ratio=0,
        with_mask=False,
        with_crowd=False,
        with_label=False,
        test_mode=True))
# optimizer
optimizer = dict(
    type='SGD',
    lr=0.01/2, #原为0.01
    momentum=0.9,
    weight_decay=0.0001,
    paramwise_options=dict(bias_lr_mult=2., bias_decay_mult=0.))
optimizer_config = dict(grad_clip=None)
# learning policy
lr_config = dict(
    policy='step',
    warmup='constant',
    warmup_iters=500,
    warmup_ratio=1.0 / 3,
    step=[8, 11])
checkpoint_config = dict(interval=1)
# yapf:disable
log_config = dict(
    interval=500,
    hooks=[
        dict(type='TextLoggerHook'),
        # dict(type='TensorboardLoggerHook')
    ])
# yapf:enable
# runtime settings
total_epochs = 12
device_ids = [0,1]
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = './work_dirs/fcos_r50_caffe_fpn_gn_1x_4gpu'
load_from = None
resume_from = None
workflow = [('train', 1)]

and the loss trend is:

2019-06-04 17:55:11,177 - INFO - Epoch [1][500/14659]	lr: 0.00167, eta: 1 day, 13:52:21, time: 0.777, data_time: 0.028, memory: 8132, loss_cls: 0.7807, loss_reg: 1.0330, loss_centerness: 0.6560, loss: 2.4698
2019-06-04 21:10:14,956 - INFO - Epoch [2][500/14659]	lr: 0.00500, eta: 1 day, 11:14:13, time: 0.818, data_time: 0.031, memory: 8136, loss_cls: 0.4015, loss_reg: 0.4589, loss_centerness: 0.6103, loss: 1.4707
2019-06-05 00:30:16,880 - INFO - Epoch [3][500/14659]	lr: 0.00500, eta: 1 day, 8:26:15, time: 0.803, data_time: 0.032, memory: 8141, loss_cls: 0.3482, loss_reg: 0.4044, loss_centerness: 0.6042, loss: 1.3569
2019-06-05 03:48:26,608 - INFO - Epoch [4][500/14659]	lr: 0.00500, eta: 1 day, 5:13:03, time: 0.808, data_time: 0.032, memory: 8141, loss_cls: 0.3275, loss_reg: 0.3721, loss_centerness: 0.6024, loss: 1.3020
2019-06-05 07:03:33,090 - INFO - Epoch [5][500/14659]	lr: 0.00500, eta: 1 day, 1:52:26, time: 0.812, data_time: 0.032, memory: 8141, loss_cls: 0.3063, loss_reg: 0.3597, loss_centerness: 0.6011, loss: 1.2671
2019-06-05 10:19:46,424 - INFO - Epoch [6][500/14659]	lr: 0.00500, eta: 22:36:29, time: 0.804, data_time: 0.032, memory: 8173, loss_cls: 0.2965, loss_reg: 0.3442, loss_centerness: 0.5983, loss: 1.2390
2019-06-05 13:37:49,787 - INFO - Epoch [7][500/14659]	lr: 0.00500, eta: 19:22:50, time: 0.827, data_time: 0.032, memory: 8173, loss_cls: 0.2826, loss_reg: 0.3351, loss_centerness: 0.5965, loss: 1.2142
2019-06-05 16:52:37,752 - INFO - Epoch [8][500/14659]	lr: 0.00500, eta: 16:06:22, time: 0.785, data_time: 0.032, memory: 8173, loss_cls: 0.2783, loss_reg: 0.3305, loss_centerness: 0.5962, loss: 1.2050
2019-06-05 20:05:11,231 - INFO - Epoch [9][500/14659]	lr: 0.00050, eta: 12:49:41, time: 0.807, data_time: 0.031, memory: 8174, loss_cls: 0.2473, loss_reg: 0.3042, loss_centerness: 0.5937, loss: 1.1452
2019-06-05 23:24:41,264 - INFO - Epoch [10][500/14659]	lr: 0.00050, eta: 9:36:41, time: 0.809, data_time: 0.032, memory: 8174, loss_cls: 0.2214, loss_reg: 0.2812, loss_centerness: 0.5904, loss: 1.0930
2019-06-06 02:43:40,252 - INFO - Epoch [11][500/14659]	lr: 0.00050, eta: 6:22:42, time: 0.819, data_time: 0.031, memory: 8174, loss_cls: 0.2136, loss_reg: 0.2748, loss_centerness: 0.5894, loss: 1.0778
2019-06-06 06:02:26,207 - INFO - Epoch [12][500/14659]	lr: 0.00005, eta: 3:08:12, time: 0.822, data_time: 0.031, memory: 8174, loss_cls: 0.2079, loss_reg: 0.2638, loss_centerness: 0.5883, loss: 1.0600

In the end the loss is about 1.06.

Is there any insight of the problem?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:18 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
YilanWangcommented, Oct 30, 2019

Yes I only change lr, but this time I can only use one gpu, so the lr is 0.02/4

when you use 1 gpu, I think the lr is 0.01/4 rather than 0.02/4. Do I misunderstand this parameter? thanks!

1reaction
yhcao6commented, Jun 6, 2019

GN is only applied on FCOS head. Could you have a try again without replacing bn with gn on other modules except FCOS head?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Passengers endure 19-hour train trip from Detroit to Chicago
PONTIAC, Mich. (AP) — What was supposed to be a 5 1/2-hour rail trip from Detroit to Chicago turned into a 19-hour ordeal...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found