There are still bugs in the scripts used to run validation in train.py
See original GitHub issueThanks for your error report and we appreciate it a lot.
Checklist
- I have searched related issues but cannot get the expected help.
- I have read the FAQ documentation but cannot get the expected help.
- The bug has not been fixed in the latest version.
Describe the bug Training scripts do not use the val pipeline when building the validation dataset, throwing the forward_train() missing 2 required positional arguments: ‘gt_boxes’ and ‘gt_labels’ error.
Reproduction
- What command or script did you run?
from mmdet.datasets import build_dataset
from mmdet.models import build_detector
from mmdet.apis import train_detector
from mmcv import Config
import mmcv
import os.path as osp
cfg = Config.fromfile('<config>')
# Build dataset
datasets = [build_dataset(cfg.data.train), build_dataset(cfg.data.val)]
# Build the detector
model = build_detector(
cfg.model,
train_cfg=cfg.get('train_cfg'),
test_cfg=cfg.get('test_cfg'))
# Add an attribute for visualization convenience
model.CLASSES = datasets[0].CLASSES
# Create work_dir
mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
train_detector(model, datasets, cfg, distributed=False, validate=True)
- Did you make any modifications on the code or config? Did you understand what you have modified?
Default mask rcnn settings on a custom dataset that I have used many times with this repo.
- What dataset did you use?
nuimage
Environment
- Please run
python mmdet/utils/collect_env.py
to collect necessary environment information and paste it here. - You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch [e.g., pip, conda, source] conda
- Other environment variables that may be related (such as
$PATH
,$LD_LIBRARY_PATH
,$PYTHONPATH
, etc.)
Error traceback
2022-03-03 01:51:24,996 - mmdet - INFO - Epoch(val) [1][16445] bbox_mAP: 0.0580, bbox_mAP_50: 0.1260, bbox_mAP_75: 0.0460, bbox_mAP_s: 0.0360, bbox_mAP_m: 0.0670, bbox_mAP_l: 0.0820, bbox_mAP_copypaste: 0.058 0.126 0.046 0.036 0.067 0.082, segm_mAP: 0.0490, segm_mAP_50: 0.1080, segm_mAP_75: 0.0390, segm_mAP_s: 0.0200, segm_mAP_m: 0.0590, segm_mAP_l: 0.0880, segm_mAP_copypaste: 0.049 0.108 0.039 0.020 0.059 0.088
Traceback (most recent call last):
File "/workspace/pycharm_projects/mmdetection/train_debugging.py", line 23, in <module>
train_detector(model, datasets, cfg, distributed=False, validate=True)
File "/workspace/pycharm_projects/mmdetection/mmdet/apis/train.py", line 208, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/anaconda3/envs/openmmlab_03012022/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/anaconda3/envs/openmmlab_03012022/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/anaconda3/envs/openmmlab_03012022/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 67, in val
self.run_iter(data_batch, train_mode=False)
File "/anaconda3/envs/openmmlab_03012022/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 32, in run_iter
outputs = self.model.val_step(data_batch, self.optimizer, **kwargs)
File "/anaconda3/envs/openmmlab_03012022/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 97, in val_step
return self.module.val_step(*inputs[0], **kwargs[0])
File "/workspace/pycharm_projects/mmdetection/mmdet/models/detectors/base.py", line 263, in val_step
losses = self(**data)
File "/anaconda3/envs/openmmlab_03012022/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/anaconda3/envs/openmmlab_03012022/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 109, in new_func
return old_func(*args, **kwargs)
File "/workspace/pycharm_projects/mmdetection/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
TypeError: forward_train() missing 2 required positional arguments: 'gt_bboxes' and 'gt_labels'
Process finished with exit code 1
Bug fix You need to properly set the validation pipeline in train. Currently, in line 188 of train.py, you set the val dataset pipeline following:
val_dataset.pipeline = cfg.data.train.pipeline
This was discussed in #5990, where the user there changed this line to use the val pipeline. However, @hhaAndroid said the following:
@jiangnanwuyanzu The above approach (changing line 188 to val_dataset.pipeline = cfg.data.val.pipeline
) is correct because the val pipeline does not have GT. At the same time, we do not recommend using val workflow but evalhook.
It is not clear what is meant by this - the training script does not appear to do this automatically, which is why I believe this is a bug.
Interestingly, this error is thrown AFTER the script successfully produces my validation set classwise results… so it is a very confusing issue. I do not want to change the scripts in your repo, especially since it seems that this is somehow intended behavior… but it seems like line 188 should be changed since the evalhook is causing training to crash after 1 validation round.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:13
The following code:
yields the following error:
The correct syntax is not intuitive. What is the issue with this code? It is all from the documentation… I still think this implementation is bugged, it needs to be updated to reflect the eval hooks. Perhaps you can clarify @jbwang1997 @RangiLyu @hhaAndroid
EDIT: the bug is caused by having the val hook called in the cfg workflow
workflow = [('train', 1),('val',1)]
as opposed toworkflow = [('train', 1)]
. I dont understand how the validation workflow/pipeline works/is called… or how we control how often we validate on an epoch basis. Some guidance would be great, the relationship between the eval hooks and these training scripts is not very clear. Thank you for your replies, im sure I am just missing something…Thanks for your suggestion. We will continue to improve related documents. For your question, the val workflow actually has the same logic as the train workflow. The difference between them is they execute different hook functions. For example, the train workflow calls
before_train_iter
andbefore_train_iter
functions while the val workflow callsbefore_val_iter
andbefore_val_iter
functions. It is worth noting that thereturn_loss
is still True in the val workflow thus the model needs the same input as the train workflow. That’s why we need to setval_dataset.pipeline = cfg.data.val.pipeline
. You can refer to details at here.The evaluation function is implemented in EvalHook and called in train workflow. As referring to EvalHook, the
_do_evaluate
function is called inafter_train_iter
. That’s why you can successfully get evaluation results once and then meet the val workflow error.