Using IterBased Runner rather than Epoch runner
See original GitHub issueCurrently by default MMDet supports epoch based runner,I tried changing the configuration to support iter based runner,because i feel that iter based runner is faster as compared to epoch based runner. for example when training HRNet in MMSeg,I was able to complete the training process for 60,000 iterations in 9 hours, however when I am training a model say fcos on coco using epoch based runner,it seems that one epoch in itself is taking almost 2 days. As such what changes should i make in the configuration to support iter based runner.
I tried changing the config for fcos r50 caffe gn head 1x coco as follows
_base_ = [
'../_base_/datasets/coco_detection.py',
'../_base_/schedules/schedule_1x.py', '../_base_/default_runtime.py'
]
# model settings
model = dict(
type='FCOS',
backbone=dict(
type='ResNet',
depth=50,
num_stages=4,
out_indices=(0, 1, 2, 3),
frozen_stages=1,
norm_cfg=dict(type='BN', requires_grad=False),
norm_eval=True,
style='caffe',
init_cfg=dict(
type='Pretrained',
checkpoint='open-mmlab://detectron/resnet50_caffe')),
neck=dict(
type='FPN',
in_channels=[256, 512, 1024, 2048],
out_channels=256,
start_level=1,
add_extra_convs='on_output', # use P5
num_outs=5,
relu_before_extra_convs=True),
bbox_head=dict(
type='FCOSHead',
num_classes=80,
in_channels=256,
stacked_convs=4,
feat_channels=256,
strides=[8, 16, 32, 64, 128],
loss_cls=dict(
type='FocalLoss',
use_sigmoid=True,
gamma=2.0,
alpha=0.25,
loss_weight=1.0),
loss_bbox=dict(type='IoULoss', loss_weight=1.0),
loss_centerness=dict(
type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0)),
# training and testing settings
train_cfg=dict(
assigner=dict(
type='MaxIoUAssigner',
pos_iou_thr=0.5,
neg_iou_thr=0.4,
min_pos_iou=0,
ignore_iof_thr=-1),
allowed_border=-1,
pos_weight=-1,
debug=False),
test_cfg=dict(
nms_pre=1000,
min_bbox_size=0,
score_thr=0.05,
nms=dict(type='nms', iou_threshold=0.5),
max_per_img=100))
img_norm_cfg = dict(
mean=[102.9801, 115.9465, 122.7717], std=[1.0, 1.0, 1.0], to_rgb=False)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img']),
])
]
data = dict(
samples_per_gpu=8,
workers_per_gpu=4,
train=dict(pipeline=train_pipeline),
val=dict(pipeline=test_pipeline),
test=dict(pipeline=test_pipeline))
# optimizer
optimizer = dict(
lr=0.01, paramwise_cfg=dict(bias_lr_mult=2., bias_decay_mult=0.))
optimizer_config = dict(
_delete_=True, grad_clip=dict(max_norm=35, norm_type=2))
And then I modified schedule_1x ,py as follows
# optimizer
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
# learning policy
lr_config = dict(policy='poly', power=0.9, min_lr=1e-4, by_epoch=False)
# runtime settings
runner = dict(type='IterBasedRunner', max_iters=160000)
checkpoint_config = dict(by_epoch=False, interval=16000)
But I ended up receiving duplicate key error
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (1 by maintainers)
Top Results From Across the Web
why iter based runner and epoch based runner val_step have ...
in epoch based runner,line 29: outputs = self.model.train_step(data_batch, self.optimizer, **kwargs); in iter based runner line 78:
Read more >Source code for mmcv.runner.epoch_based_runner
[docs]@RUNNERS.register_module() class EpochBasedRunner(BaseRunner): """Epoch-based Runner. This runner train models epoch by epoch.
Read more >Epoch vs Iteration when training neural networks [closed]
Therefore, the gradient descent optimizer results in smoother convergence than Mini-batch gradient descent, but it takes more time.
Read more >I've been using Epoch runner mod for about 2 months ... - Reddit
Start by launching satellites to collect data, rather than just being able to magically fly to the moon. Featuring a new look to...
Read more >#Shorts Solar Tower Tutorial - Epoch Runner - YouTube
Be sure to drop a like if you enjoy! Subscribe with notifications on to be notified when I upload and go LIVE!...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
ok will look into it.but the config 's base is relying on schedule_1x.py which uses max_epochs,When I tried changing the LR config to max_iter,it gave me an error that runner can either be max_epoch or max_iter not both. I will try your suggestion,and if there is an error will post the stack trace. Thanks
@BIGWangYuDong was able to start training by using your suggestion ,thanks