Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Wandb logging bug using Iteration based runner

See original GitHub issue

Describe the Issue Validation metrics reporting/logging to wandb does not happen when using IterBasedRunner

Reproduction

Here’s a simple reproduction of the bug.

Using mmdetection,

In config file, faster_rcnn/faster_rcnn_r50_caffe_c4_1x_coco.py with the following edits:

max_iters = 100 
runner = dict(
    _delete_=True, 
    type='IterBasedRunner', 
    max_iters=max_iters
)

lr_config = dict(
    policy='step',
    gamma=0.1,
    by_epoch=False,
    warmup='linear',
    warmup_by_epoch=False,
    warmup_ratio=1.0,  # no warmup
    warmup_iters=10
    )

interval = 10
workflow = [('train', interval)]
checkpoint_config = dict(
    by_epoch=False, interval=interval)

evaluation = dict(
    interval=interval,
    metric=['bbox'])

log_config = dict(
    interval=5,
    hooks=[
        dict(type='TextLoggerHook', by_epoch=False),
        dict(
            type='WandbLoggerHook',
            init_kwargs=dict(
                project='train-tests',
                name='short'
                ),
            out_suffix=('.log.json', '.log', '.py'),
            by_epoch=False,
            ),
        ]
    )

Did you make any modifications on the code? Did you understand what you have modified?

Environment Python 3.8.8 mmcv 1.4.5 mmdet 2.25.0 wandb 0.12.0

Error traceback If applicable, paste the error traceback here.

2022-06-22 13:55:37,523 - mmdet - INFO - Iter [5/100]   lr: 2.000e-02, eta: 0:01:18, time: 0.831, data_time: 0.035, memory: 7614, loss_rpn_cls: 0.1022, loss_rpn_bbox: 0.2308, loss_cls: 0.1
789, acc: 94.2188, loss_bbox: 0.1995, loss: 0.7115
2022-06-22 13:55:41,200 - mmdet - INFO - Saving checkpoint at 10 iterations
2022-06-22 13:55:43,173 - mmdet - INFO - Iter [10/100]  lr: 2.000e-03, eta: 0:01:32, time: 1.234, data_time: 0.111, memory: 7653, loss_rpn_cls: 0.1946, loss_rpn_bbox: 0.2274, loss_cls: 0.2
475, acc: 92.6758, loss_bbox: 0.2635, loss: 0.9330
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 100/100, 5.3 task/s, elapsed: 19s, ETA:     0s

2022-06-22 13:56:02,333 - mmdet - INFO - Evaluating bbox...
Loading and preparing results...
DONE (t=0.00s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.73s).
Accumulating evaluation results...
DONE (t=0.50s).
2022-06-22 13:56:03,582 - mmdet - INFO -
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.275
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=1000 ] = 0.429
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=1000 ] = 0.295
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.106
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.302
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.448
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.362
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=300 ] = 0.362
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.362
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.162
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.382
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.535

2022-06-22 13:56:03,590 - mmdet - INFO - Iter(val) [100]        bbox_mAP: 0.2750, bbox_mAP_50: 0.4290, bbox_mAP_75: 0.2950, bbox_mAP_s: 0.1060, bbox_mAP_m: 0.3020, bbox_mAP_l: 0.4480, bbox
_mAP_copypaste: 0.275 0.429 0.295 0.106 0.302 0.448
wandb: WARNING Step must only increase in log calls.  Step 10 < 11; dropping {'val/bbox_mAP': 0.275, 'val/bbox_mAP_50': 0.429, 'val/bbox_mAP_75': 0.295, 'val/bbox_mAP_s': 0.106, 'val/bbox_
mAP_m': 0.302, 'val/bbox_mAP_l': 0.448, 'learning_rate': 0.002, 'momentum': 0.9}.
2022-06-22 13:56:07,778 - mmdet - INFO - Iter [15/100]  lr: 2.000e-04, eta: 0:03:14, time: 4.811, data_time: 4.090, memory: 7653, loss_rpn_cls: 0.1167, loss_rpn_bbox: 0.2035, loss_cls: 0.2
403, acc: 94.2578, loss_bbox: 0.1729, loss: 0.7333
2022-06-22 13:56:11,395 - mmdet - INFO - Saving checkpoint at 20 iterations
2022-06-22 13:56:13,374 - mmdet - INFO - Iter [20/100]  lr: 2.000e-04, eta: 0:02:42, time: 1.229, data_time: 0.117, memory: 7653, loss_rpn_cls: 0.0972, loss_rpn_bbox: 0.1758, loss_cls: 0.2
176, acc: 92.2461, loss_bbox: 0.2259, loss: 0.7165

Bug fix In the wandb hook, for the log method, self.wandb.log is called with commit=True by default all the time. Therefore, the log call from last training step (before validation) will cause wandb to increment step by one. Then when wandb.log is called for the validation metric, wandb’s step will be ahead of the current step (at validation) by one.

Is there a good way to commit only after the each validation is done?