Wandb logging bug using Iteration based runner
See original GitHub issueDescribe the Issue
Validation metrics reporting/logging to wandb does not happen when using IterBasedRunner
Reproduction
Here’s a simple reproduction of the bug.
Using mmdetection,
In config file, faster_rcnn/faster_rcnn_r50_caffe_c4_1x_coco.py
with the following edits:
max_iters = 100
runner = dict(
_delete_=True,
type='IterBasedRunner',
max_iters=max_iters
)
lr_config = dict(
policy='step',
gamma=0.1,
by_epoch=False,
warmup='linear',
warmup_by_epoch=False,
warmup_ratio=1.0, # no warmup
warmup_iters=10
)
interval = 10
workflow = [('train', interval)]
checkpoint_config = dict(
by_epoch=False, interval=interval)
evaluation = dict(
interval=interval,
metric=['bbox'])
log_config = dict(
interval=5,
hooks=[
dict(type='TextLoggerHook', by_epoch=False),
dict(
type='WandbLoggerHook',
init_kwargs=dict(
project='train-tests',
name='short'
),
out_suffix=('.log.json', '.log', '.py'),
by_epoch=False,
),
]
)
- Did you make any modifications on the code? Did you understand what you have modified?
Environment Python 3.8.8 mmcv 1.4.5 mmdet 2.25.0 wandb 0.12.0
Error traceback If applicable, paste the error traceback here.
2022-06-22 13:55:37,523 - mmdet - INFO - Iter [5/100] lr: 2.000e-02, eta: 0:01:18, time: 0.831, data_time: 0.035, memory: 7614, loss_rpn_cls: 0.1022, loss_rpn_bbox: 0.2308, loss_cls: 0.1
789, acc: 94.2188, loss_bbox: 0.1995, loss: 0.7115
2022-06-22 13:55:41,200 - mmdet - INFO - Saving checkpoint at 10 iterations
2022-06-22 13:55:43,173 - mmdet - INFO - Iter [10/100] lr: 2.000e-03, eta: 0:01:32, time: 1.234, data_time: 0.111, memory: 7653, loss_rpn_cls: 0.1946, loss_rpn_bbox: 0.2274, loss_cls: 0.2
475, acc: 92.6758, loss_bbox: 0.2635, loss: 0.9330
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 100/100, 5.3 task/s, elapsed: 19s, ETA: 0s
2022-06-22 13:56:02,333 - mmdet - INFO - Evaluating bbox...
Loading and preparing results...
DONE (t=0.00s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.73s).
Accumulating evaluation results...
DONE (t=0.50s).
2022-06-22 13:56:03,582 - mmdet - INFO -
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.275
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.429
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.295
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.106
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.302
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.448
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.362
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.362
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.362
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.162
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.382
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.535
2022-06-22 13:56:03,590 - mmdet - INFO - Iter(val) [100] bbox_mAP: 0.2750, bbox_mAP_50: 0.4290, bbox_mAP_75: 0.2950, bbox_mAP_s: 0.1060, bbox_mAP_m: 0.3020, bbox_mAP_l: 0.4480, bbox
_mAP_copypaste: 0.275 0.429 0.295 0.106 0.302 0.448
wandb: WARNING Step must only increase in log calls. Step 10 < 11; dropping {'val/bbox_mAP': 0.275, 'val/bbox_mAP_50': 0.429, 'val/bbox_mAP_75': 0.295, 'val/bbox_mAP_s': 0.106, 'val/bbox_
mAP_m': 0.302, 'val/bbox_mAP_l': 0.448, 'learning_rate': 0.002, 'momentum': 0.9}.
2022-06-22 13:56:07,778 - mmdet - INFO - Iter [15/100] lr: 2.000e-04, eta: 0:03:14, time: 4.811, data_time: 4.090, memory: 7653, loss_rpn_cls: 0.1167, loss_rpn_bbox: 0.2035, loss_cls: 0.2
403, acc: 94.2578, loss_bbox: 0.1729, loss: 0.7333
2022-06-22 13:56:11,395 - mmdet - INFO - Saving checkpoint at 20 iterations
2022-06-22 13:56:13,374 - mmdet - INFO - Iter [20/100] lr: 2.000e-04, eta: 0:02:42, time: 1.229, data_time: 0.117, memory: 7653, loss_rpn_cls: 0.0972, loss_rpn_bbox: 0.1758, loss_cls: 0.2
176, acc: 92.2461, loss_bbox: 0.2259, loss: 0.7165
Bug fix
In the wandb hook, for the log
method, self.wandb.log
is called with commit=True
by default all the time. Therefore, the log call from last training step (before validation) will cause wandb to increment step by one. Then when wandb.log is called for the validation metric, wandb’s step will be ahead of the current step (at validation) by one.
Is there a good way to commit only after the each validation is done?
Issue Analytics
- State:
- Created a year ago
- Comments:12 (1 by maintainers)
Hi @levan92 , as a workaround, you can set
with_step
as False. More discussions about the argument can be found at #913https://github.com/open-mmlab/mmcv/blob/1f2500102834a01b86bf9ae4db227cd8d724fa6e/mmcv/runner/hooks/logger/wandb.py#L35
Thanks for checking it out @levan92. I will investigate more in this direction and make a PR to fix it.