question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Wandb logging bug using Iteration based runner

See original GitHub issue

Describe the Issue Validation metrics reporting/logging to wandb does not happen when using IterBasedRunner

Reproduction

Here’s a simple reproduction of the bug.

Using mmdetection,

In config file, faster_rcnn/faster_rcnn_r50_caffe_c4_1x_coco.py with the following edits:

max_iters = 100 
runner = dict(
    _delete_=True, 
    type='IterBasedRunner', 
    max_iters=max_iters
)

lr_config = dict(
    policy='step',
    gamma=0.1,
    by_epoch=False,
    warmup='linear',
    warmup_by_epoch=False,
    warmup_ratio=1.0,  # no warmup
    warmup_iters=10
    )

interval = 10
workflow = [('train', interval)]
checkpoint_config = dict(
    by_epoch=False, interval=interval)

evaluation = dict(
    interval=interval,
    metric=['bbox'])

log_config = dict(
    interval=5,
    hooks=[
        dict(type='TextLoggerHook', by_epoch=False),
        dict(
            type='WandbLoggerHook',
            init_kwargs=dict(
                project='train-tests',
                name='short'
                ),
            out_suffix=('.log.json', '.log', '.py'),
            by_epoch=False,
            ),
        ]
    )

  1. Did you make any modifications on the code? Did you understand what you have modified?

Environment Python 3.8.8 mmcv 1.4.5 mmdet 2.25.0 wandb 0.12.0

Error traceback If applicable, paste the error traceback here.

2022-06-22 13:55:37,523 - mmdet - INFO - Iter [5/100]   lr: 2.000e-02, eta: 0:01:18, time: 0.831, data_time: 0.035, memory: 7614, loss_rpn_cls: 0.1022, loss_rpn_bbox: 0.2308, loss_cls: 0.1
789, acc: 94.2188, loss_bbox: 0.1995, loss: 0.7115
2022-06-22 13:55:41,200 - mmdet - INFO - Saving checkpoint at 10 iterations
2022-06-22 13:55:43,173 - mmdet - INFO - Iter [10/100]  lr: 2.000e-03, eta: 0:01:32, time: 1.234, data_time: 0.111, memory: 7653, loss_rpn_cls: 0.1946, loss_rpn_bbox: 0.2274, loss_cls: 0.2
475, acc: 92.6758, loss_bbox: 0.2635, loss: 0.9330
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 100/100, 5.3 task/s, elapsed: 19s, ETA:     0s

2022-06-22 13:56:02,333 - mmdet - INFO - Evaluating bbox...
Loading and preparing results...
DONE (t=0.00s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.73s).
Accumulating evaluation results...
DONE (t=0.50s).
2022-06-22 13:56:03,582 - mmdet - INFO -
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.275
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=1000 ] = 0.429
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=1000 ] = 0.295
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.106
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.302
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.448
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.362
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=300 ] = 0.362
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.362
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.162
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.382
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.535

2022-06-22 13:56:03,590 - mmdet - INFO - Iter(val) [100]        bbox_mAP: 0.2750, bbox_mAP_50: 0.4290, bbox_mAP_75: 0.2950, bbox_mAP_s: 0.1060, bbox_mAP_m: 0.3020, bbox_mAP_l: 0.4480, bbox
_mAP_copypaste: 0.275 0.429 0.295 0.106 0.302 0.448
wandb: WARNING Step must only increase in log calls.  Step 10 < 11; dropping {'val/bbox_mAP': 0.275, 'val/bbox_mAP_50': 0.429, 'val/bbox_mAP_75': 0.295, 'val/bbox_mAP_s': 0.106, 'val/bbox_
mAP_m': 0.302, 'val/bbox_mAP_l': 0.448, 'learning_rate': 0.002, 'momentum': 0.9}.
2022-06-22 13:56:07,778 - mmdet - INFO - Iter [15/100]  lr: 2.000e-04, eta: 0:03:14, time: 4.811, data_time: 4.090, memory: 7653, loss_rpn_cls: 0.1167, loss_rpn_bbox: 0.2035, loss_cls: 0.2
403, acc: 94.2578, loss_bbox: 0.1729, loss: 0.7333
2022-06-22 13:56:11,395 - mmdet - INFO - Saving checkpoint at 20 iterations
2022-06-22 13:56:13,374 - mmdet - INFO - Iter [20/100]  lr: 2.000e-04, eta: 0:02:42, time: 1.229, data_time: 0.117, memory: 7653, loss_rpn_cls: 0.0972, loss_rpn_bbox: 0.1758, loss_cls: 0.2
176, acc: 92.2461, loss_bbox: 0.2259, loss: 0.7165

Bug fix In the wandb hook, for the log method, self.wandb.log is called with commit=True by default all the time. Therefore, the log call from last training step (before validation) will cause wandb to increment step by one. Then when wandb.log is called for the validation metric, wandb’s step will be ahead of the current step (at validation) by one.

Is there a good way to commit only after the each validation is done?

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:12 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
zhouzaidacommented, Jun 25, 2022

Hi @levan92 , as a workaround, you can set with_step as False. More discussions about the argument can be found at #913

https://github.com/open-mmlab/mmcv/blob/1f2500102834a01b86bf9ae4db227cd8d724fa6e/mmcv/runner/hooks/logger/wandb.py#L35

1reaction
ayulockincommented, Jun 24, 2022

Thanks for checking it out @levan92. I will investigate more in this direction and make a PR to fix it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

wandb sync not logging in while running wandb local #1239
I tried to run wandb sync MY_DRYRUN and was asked to run wandb login . I ran wandb login and a webpage briefly...
Read more >
Log Data with wandb.log - Documentation - Weights & Biases
Information logged to Weights & Biases with wandb.log is tracked over time, forming the "history" of a run. By default, each call to...
Read more >
Debugging Neural Networks with PyTorch and W&B Using ...
I have implemented a class LRfinder. The method range_test holds the logic described above. Using wandb.log() I was able to log the learning ......
Read more >
Training Reproducible Robots with W&B - Wandb
In this report, I'll share how I'm using Artifacts to reduce wear and tear on the robot, Experiment Tracking as an easy to...
Read more >
Announcing W&B Tables: Iterate on Your Data - Wandb
Tables will help you achieve faster model development cycles and a deeper understanding of how your models work. With Tables, you can log, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found