Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

There are still bugs in the scripts used to run validation in train.py

See original GitHub issue

Thanks for your error report and we appreciate it a lot.

Checklist

I have searched related issues but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug Training scripts do not use the val pipeline when building the validation dataset, throwing the forward_train() missing 2 required positional arguments: ‘gt_boxes’ and ‘gt_labels’ error.

Reproduction

What command or script did you run?

from mmdet.datasets import build_dataset
from mmdet.models import build_detector
from mmdet.apis import train_detector
from mmcv import Config
import mmcv
import os.path as osp

cfg = Config.fromfile('<config>')

# Build dataset
datasets = [build_dataset(cfg.data.train), build_dataset(cfg.data.val)]

# Build the detector
model = build_detector(
    cfg.model,
    train_cfg=cfg.get('train_cfg'),
    test_cfg=cfg.get('test_cfg'))
# Add an attribute for visualization convenience
model.CLASSES = datasets[0].CLASSES

# Create work_dir
mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
train_detector(model, datasets, cfg, distributed=False, validate=True)

Did you make any modifications on the code or config? Did you understand what you have modified?

Default mask rcnn settings on a custom dataset that I have used many times with this repo.

What dataset did you use?

nuimage

Environment

Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here.
You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch [e.g., pip, conda, source] conda
- Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback

2022-03-03 01:51:24,996 - mmdet - INFO - Epoch(val) [1][16445] bbox_mAP: 0.0580, bbox_mAP_50: 0.1260, bbox_mAP_75: 0.0460, bbox_mAP_s: 0.0360, bbox_mAP_m: 0.0670, bbox_mAP_l: 0.0820, bbox_mAP_copypaste: 0.058 0.126 0.046 0.036 0.067 0.082, segm_mAP: 0.0490, segm_mAP_50: 0.1080, segm_mAP_75: 0.0390, segm_mAP_s: 0.0200, segm_mAP_m: 0.0590, segm_mAP_l: 0.0880, segm_mAP_copypaste: 0.049 0.108 0.039 0.020 0.059 0.088
Traceback (most recent call last):
  File "/workspace/pycharm_projects/mmdetection/train_debugging.py", line 23, in <module>
    train_detector(model, datasets, cfg, distributed=False, validate=True)
  File "/workspace/pycharm_projects/mmdetection/mmdet/apis/train.py", line 208, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/anaconda3/envs/openmmlab_03012022/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/anaconda3/envs/openmmlab_03012022/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/anaconda3/envs/openmmlab_03012022/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 67, in val
    self.run_iter(data_batch, train_mode=False)
  File "/anaconda3/envs/openmmlab_03012022/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 32, in run_iter
    outputs = self.model.val_step(data_batch, self.optimizer, **kwargs)
  File "/anaconda3/envs/openmmlab_03012022/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 97, in val_step
    return self.module.val_step(*inputs[0], **kwargs[0])
  File "/workspace/pycharm_projects/mmdetection/mmdet/models/detectors/base.py", line 263, in val_step
    losses = self(**data)
  File "/anaconda3/envs/openmmlab_03012022/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/anaconda3/envs/openmmlab_03012022/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 109, in new_func
    return old_func(*args, **kwargs)
  File "/workspace/pycharm_projects/mmdetection/mmdet/models/detectors/base.py", line 172, in forward
    return self.forward_train(img, img_metas, **kwargs)
TypeError: forward_train() missing 2 required positional arguments: 'gt_bboxes' and 'gt_labels'

Process finished with exit code 1

Bug fix You need to properly set the validation pipeline in train. Currently, in line 188 of train.py, you set the val dataset pipeline following:

val_dataset.pipeline = cfg.data.train.pipeline

This was discussed in #5990, where the user there changed this line to use the val pipeline. However, @hhaAndroid said the following:

@jiangnanwuyanzu The above approach (changing line 188 to val_dataset.pipeline = cfg.data.val.pipeline) is correct because the val pipeline does not have GT. At the same time, we do not recommend using val workflow but evalhook.

It is not clear what is meant by this - the training script does not appear to do this automatically, which is why I believe this is a bug.

Interestingly, this error is thrown AFTER the script successfully produces my validation set classwise results… so it is a very confusing issue. I do not want to change the scripts in your repo, especially since it seems that this is somehow intended behavior… but it seems like line 188 should be changed since the evalhook is causing training to crash after 1 validation round.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:13

Top GitHub Comments

2reactions

pcicalescommented, Mar 7, 2022

The following code:

from mmdet.datasets import build_dataset
from mmdet.models import build_detector
from mmdet.apis import train_detector
from mmcv import Config
import mmcv
import os.path as osp

cfg = Config.fromfile('config')

# Build dataset
datasets = [build_dataset(cfg.data.train)]

# Build the detector
model = build_detector(
    cfg.model,
    train_cfg=cfg.get('train_cfg'),
    test_cfg=cfg.get('test_cfg'))
# Add an attribute for visualization convenience
model.CLASSES = datasets[0].CLASSES

# Create work_dir
mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
train_detector(model, datasets, cfg, distributed=False, validate=True)

yields the following error:

loading annotations into memory...
Done (t=6.43s)
creating index...
index created!
Traceback (most recent call last):
  File "/workspace/pycharm_projects/mmdetection/train_single_gpu.py", line 23, in <module>
    train_detector(model, datasets, cfg, distributed=False, validate=True)
  File "/workspace/pycharm_projects/mmdetection/mmdet/apis/train.py", line 208, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/anaconda3/envs/openmmlab_03012022/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 85, in run
    assert len(data_loaders) == len(workflow)
AssertionError

Process finished with exit code 1

The correct syntax is not intuitive. What is the issue with this code? It is all from the documentation… I still think this implementation is bugged, it needs to be updated to reflect the eval hooks. Perhaps you can clarify @jbwang1997 @RangiLyu @hhaAndroid

EDIT: the bug is caused by having the val hook called in the cfg workflow workflow = [('train', 1),('val',1)] as opposed to workflow = [('train', 1)] . I dont understand how the validation workflow/pipeline works/is called… or how we control how often we validate on an epoch basis. Some guidance would be great, the relationship between the eval hooks and these training scripts is not very clear. Thank you for your replies, im sure I am just missing something…

1reaction

jbwang1997commented, Mar 7, 2022

Thanks for your suggestion. We will continue to improve related documents. For your question, the val workflow actually has the same logic as the train workflow. The difference between them is they execute different hook functions. For example, the train workflow calls before_train_iter and before_train_iter functions while the val workflow calls before_val_iter and before_val_iter functions. It is worth noting that the return_loss is still True in the val workflow thus the model needs the same input as the train workflow. That’s why we need to set val_dataset.pipeline = cfg.data.val.pipeline. You can refer to details at here.

The evaluation function is implemented in EvalHook and called in train workflow. As referring to EvalHook, the _do_evaluate function is called in after_train_iter. That’s why you can successfully get evaluation results once and then meet the val workflow error.

Top Results From Across the Web

Trainer — PyTorch Lightning 1.8.5.post0 documentation

Sanity check runs n batches of val before starting the training routine. This catches any bugs in your validation without having to wait...

Why is my validation loss lower than my training loss?

Reason #3: Your validation set may be easier than your training set or there is a leak in your data/bug in your code....

What Are IQ OQ PQ, The 3 Q's Of Software Validation Process

It is better to use the bug tracking tool to track the issues. Monitor each issue carefully and take it to closure as...

Trainer - Hugging Face

The Trainer class provides an API for feature-complete training in PyTorch for most standard use cases. It's used in most of the example...

Bug listing with status RESOLVED with resolution OBSOLETE ...

Bug :1523 - "[IDEA] Offload work by distributing trivial ebuild ... not build with doc USE flag (amd64)" status:RESOLVED resolution:OBSOLETE severity:normal ...