DetectoRS loss is always NaN
See original GitHub issueChecklist
- I have searched related issues but cannot get the expected help.
- I have read the FAQ documentation but cannot get the expected help.
- The bug has not been fixed in the latest version.
Describe the bug The loss using any config in detectors is always nan using my own Coco style dataset(s) .Every other model in this repo works fine on the dataset (Object detection as well as instance segmentation). I’ve tried every config in the detectors config folder for object detection and double checked the dataset that no 0 size boxes are in there. I’ve pasted the config file at the end of the issue if that helps. I also looked through all the issues and issue 6740 seemed to have the same problem but was closed Reproduction
- What command or script did you run?
model = build_detector(self.cfg.model,
train_cfg=self.cfg.get("train_cfg"),
test_cfg=self.cfg.get('test_cfg'))
datasets = [build_dataset(self.cfg.data.train)]
train_detector(self.model, datasets[0], self.cfg, distributed=False, validate=True)
where self.cfg is the config at the end of this issue.
-
Did you make any modifications on the code or config? Did you understand what you have modified?
The only things I’ve changed are dataset related (data_root, classes,…), samples_per_gpu and the num_classes. But since every other model works just fine (FasterRcnn, MaskRCNN, YOLOX…) I don’t really know where the problem is.
-
What dataset did you use? My custom Dataset
Environment
- Please run
python mmdet/utils/collect_env.py
sys.platform: win32 Python: 3.8.10 (default, May 19 2021, 13:12:57) [MSC v.1916 64 bit (AMD64)] CUDA available: True GPU 0: NVIDIA GeForce RTX 3090 CUDA_HOME: None GCC: n/a PyTorch: 1.9.0 PyTorch compiling details: PyTorch built with: C++ Version: 199711 MSVC 192829337 Intel® Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel® 64 architecture applications Intel® MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb) OpenMP 2019 CPU capability usage: AVX2 CUDA Runtime 11.1 NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37 CuDNN 8.0.5 Magma 2.5.4 Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=C:/cb/pytorch_1000000000000/work/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/cb/pytorch_1000000000000/work/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, TorchVision: 0.10.0 OpenCV: 4.5.3 MMCV: 1.4.3 MMCV Compiler: MSVC 192930137 MMCV CUDA Compiler: 11.1 MMDetection: 2.20.0+ff9bc39
Error traceback The first few steps of the training log (The rest looks about the same):
2022-02-01 17:54:32,698 - mmdet - INFO - Epoch [1][1/93] lr: 2.000e-05, eta: 0:06:19, time: 4.122, data_time: 2.085, memory: 7863, loss_rpn_cls: nan, loss_rpn_bbox: nan, s0.loss_cls: nan, s0.acc: 0.0000, s0.loss_bbox: nan, s1.loss_cls: nan, s1.acc: 0.0000, s1.loss_bbox: nan, s2.loss_cls: nan, s2.acc: 0.0000, s2.loss_bbox: nan, loss: nan 2022-02-01 17:54:33,278 - mmdet - INFO - Epoch [1][2/93] lr: 5.996e-05, eta: 0:03:33, time: 0.579, data_time: 0.077, memory: 8430, loss_rpn_cls: nan, loss_rpn_bbox: nan, s0.loss_cls: nan, s0.acc: 0.0000, s0.loss_bbox: nan, s1.loss_cls: nan, s1.acc: 0.0000, s1.loss_bbox: nan, s2.loss_cls: nan, s2.acc: 0.0000, s2.loss_bbox: nan, loss: nan 2022-02-01 17:54:33,834 - mmdet - INFO - Epoch [1][3/93] lr: 9.992e-05, eta: 0:02:37, time: 0.556, data_time: 0.077, memory: 8430, loss_rpn_cls: nan, loss_rpn_bbox: nan, s0.loss_cls: nan, s0.acc: 1.4205, s0.loss_bbox: nan, s1.loss_cls: nan, s1.acc: 1.4205, s1.loss_bbox: nan, s2.loss_cls: nan, s2.acc: 1.4205, s2.loss_bbox: nan, loss: nan 2022-02-01 17:54:34,410 - mmdet - INFO - Epoch [1][4/93] lr: 1.399e-04, eta: 0:02:09, time: 0.576, data_time: 0.077, memory: 8430, loss_rpn_cls: nan, loss_rpn_bbox: nan, s0.loss_cls: nan, s0.acc: 0.0000, s0.loss_bbox: nan, s1.loss_cls: nan, s1.acc: 0.0000, s1.loss_bbox: nan, s2.loss_cls: nan, s2.acc: 0.0000, s2.loss_bbox: nan, loss: nan
Just one config as an example, but this happens for every config in the detectors folder.
{‘model’: {‘type’: ‘CascadeRCNN’, ‘backbone’: {‘type’: ‘DetectoRS_ResNet’, ‘depth’: 50, ‘num_stages’: 4, ‘out_indices’: (0, 1, 2, 3), ‘frozen_stages’: 1, ‘norm_cfg’: {‘type’: ‘BN’, ‘requires_grad’: True}, ‘norm_eval’: True, ‘style’: ‘pytorch’, ‘init_cfg’: {‘type’: ‘Pretrained’, ‘checkpoint’: ‘torchvision://resnet50’}, ‘conv_cfg’: {‘type’: ‘ConvAWS’}, ‘output_img’: True}, ‘neck’: {‘type’: ‘RFP’, ‘in_channels’: [256, 512, 1024, 2048], ‘out_channels’: 256, ‘num_outs’: 5, ‘rfp_steps’: 2, ‘aspp_out_channels’: 64, ‘aspp_dilations’: (1, 3, 6, 1), ‘rfp_backbone’: {‘rfp_inplanes’: 256, ‘type’: ‘DetectoRS_ResNet’, ‘depth’: 50, ‘num_stages’: 4, ‘out_indices’: (0, 1, 2, 3), ‘frozen_stages’: 1, ‘norm_cfg’: {‘type’: ‘BN’, ‘requires_grad’: True}, ‘norm_eval’: True, ‘conv_cfg’: {‘type’: ‘ConvAWS’}, ‘pretrained’: ‘torchvision://resnet50’, ‘style’: ‘pytorch’}}, ‘rpn_head’: {‘type’: ‘RPNHead’, ‘in_channels’: 256, ‘feat_channels’: 256, ‘anchor_generator’: {‘type’: ‘AnchorGenerator’, ‘scales’: [8], ‘ratios’: [0.5, 1.0, 2.0], ‘strides’: [4, 8, 16, 32, 64]}, ‘bbox_coder’: {‘type’: ‘DeltaXYWHBBoxCoder’, ‘target_means’: [0.0, 0.0, 0.0, 0.0], ‘target_stds’: [1.0, 1.0, 1.0, 1.0]}, ‘loss_cls’: {‘type’: ‘CrossEntropyLoss’, ‘use_sigmoid’: True, ‘loss_weight’: 1.0}, ‘loss_bbox’: {‘type’: ‘SmoothL1Loss’, ‘beta’: 0.1111111111111111, ‘loss_weight’: 1.0}}, ‘roi_head’: {‘type’: ‘CascadeRoIHead’, ‘num_stages’: 3, ‘stage_loss_weights’: [1, 0.5, 0.25], ‘bbox_roi_extractor’: {‘type’: ‘SingleRoIExtractor’, ‘roi_layer’: {‘type’: ‘RoIAlign’, ‘output_size’: 7, ‘sampling_ratio’: 0}, ‘out_channels’: 256, ‘featmap_strides’: [4, 8, 16, 32]}, ‘bbox_head’: [{‘type’: ‘Shared2FCBBoxHead’, ‘in_channels’: 256, ‘fc_out_channels’: 1024, ‘roi_feat_size’: 7, ‘num_classes’: 4, ‘bbox_coder’: {‘type’: ‘DeltaXYWHBBoxCoder’, ‘target_means’: [0.0, 0.0, 0.0, 0.0], ‘target_stds’: [0.1, 0.1, 0.2, 0.2]}, ‘reg_class_agnostic’: True, ‘loss_cls’: {‘type’: ‘CrossEntropyLoss’, ‘use_sigmoid’: False, ‘loss_weight’: 1.0}, ‘loss_bbox’: {‘type’: ‘SmoothL1Loss’, ‘beta’: 1.0, ‘loss_weight’: 1.0}}, {‘type’: ‘Shared2FCBBoxHead’, ‘in_channels’: 256, ‘fc_out_channels’: 1024, ‘roi_feat_size’: 7, ‘num_classes’: 4, ‘bbox_coder’: {‘type’: ‘DeltaXYWHBBoxCoder’, ‘target_means’: [0.0, 0.0, 0.0, 0.0], ‘target_stds’: [0.05, 0.05, 0.1, 0.1]}, ‘reg_class_agnostic’: True, ‘loss_cls’: {‘type’: ‘CrossEntropyLoss’, ‘use_sigmoid’: False, ‘loss_weight’: 1.0}, ‘loss_bbox’: {‘type’: ‘SmoothL1Loss’, ‘beta’: 1.0, ‘loss_weight’: 1.0}}, {‘type’: ‘Shared2FCBBoxHead’, ‘in_channels’: 256, ‘fc_out_channels’: 1024, ‘roi_feat_size’: 7, ‘num_classes’: 4, ‘bbox_coder’: {‘type’: ‘DeltaXYWHBBoxCoder’, ‘target_means’: [0.0, 0.0, 0.0, 0.0], ‘target_stds’: [0.033, 0.033, 0.067, 0.067]}, ‘reg_class_agnostic’: True, ‘loss_cls’: {‘type’: ‘CrossEntropyLoss’, ‘use_sigmoid’: False, ‘loss_weight’: 1.0}, ‘loss_bbox’: {‘type’: ‘SmoothL1Loss’, ‘beta’: 1.0, ‘loss_weight’: 1.0}}]}, ‘train_cfg’: {‘rpn’: {‘assigner’: {‘type’: ‘MaxIoUAssigner’, ‘pos_iou_thr’: 0.7, ‘neg_iou_thr’: 0.3, ‘min_pos_iou’: 0.3, ‘match_low_quality’: True, ‘ignore_iof_thr’: -1}, ‘sampler’: {‘type’: ‘RandomSampler’, ‘num’: 256, ‘pos_fraction’: 0.5, ‘neg_pos_ub’: -1, ‘add_gt_as_proposals’: False}, ‘allowed_border’: 0, ‘pos_weight’: -1, ‘debug’: False}, ‘rpn_proposal’: {‘nms_pre’: 2000, ‘max_per_img’: 2000, ‘nms’: {‘type’: ‘nms’, ‘iou_threshold’: 0.7}, ‘min_bbox_size’: 0}, ‘rcnn’: [{‘assigner’: {‘type’: ‘MaxIoUAssigner’, ‘pos_iou_thr’: 0.5, ‘neg_iou_thr’: 0.5, ‘min_pos_iou’: 0.5, ‘match_low_quality’: False, ‘ignore_iof_thr’: -1}, ‘sampler’: {‘type’: ‘RandomSampler’, ‘num’: 512, ‘pos_fraction’: 0.25, ‘neg_pos_ub’: -1, ‘add_gt_as_proposals’: True}, ‘pos_weight’: -1, ‘debug’: False}, {‘assigner’: {‘type’: ‘MaxIoUAssigner’, ‘pos_iou_thr’: 0.6, ‘neg_iou_thr’: 0.6, ‘min_pos_iou’: 0.6, ‘match_low_quality’: False, ‘ignore_iof_thr’: -1}, ‘sampler’: {‘type’: ‘RandomSampler’, ‘num’: 512, ‘pos_fraction’: 0.25, ‘neg_pos_ub’: -1, ‘add_gt_as_proposals’: True}, ‘pos_weight’: -1, ‘debug’: False}, {‘assigner’: {‘type’: ‘MaxIoUAssigner’, ‘pos_iou_thr’: 0.7, ‘neg_iou_thr’: 0.7, ‘min_pos_iou’: 0.7, ‘match_low_quality’: False, ‘ignore_iof_thr’: -1}, ‘sampler’: {‘type’: ‘RandomSampler’, ‘num’: 512, ‘pos_fraction’: 0.25, ‘neg_pos_ub’: -1, ‘add_gt_as_proposals’: True}, ‘pos_weight’: -1, ‘debug’: False}]}, ‘test_cfg’: {‘rpn’: {‘nms_pre’: 1000, ‘max_per_img’: 1000, ‘nms’: {‘type’: ‘nms’, ‘iou_threshold’: 0.7}, ‘min_bbox_size’: 0}, ‘rcnn’: {‘score_thr’: 0.05, ‘nms’: {‘type’: ‘nms’, ‘iou_threshold’: 0.5}, ‘max_per_img’: 100}}}, ‘dataset_type’: ‘CocoDataset’, ‘data_root’: ‘F:\source\repos\YOLOX\datasets\testimages’, ‘img_norm_cfg’: {‘mean’: [123.675, 116.28, 103.53], ‘std’: [58.395, 57.12, 57.375], ‘to_rgb’: True}, ‘train_pipeline’: [{‘type’: ‘LoadImageFromFile’}, {‘type’: ‘LoadAnnotations’, ‘with_bbox’: True}, {‘type’: ‘Resize’, ‘img_scale’: (1333, 800), ‘keep_ratio’: True}, {‘type’: ‘RandomFlip’, ‘flip_ratio’: 0.5}, {‘type’: ‘Normalize’, ‘mean’: [123.675, 116.28, 103.53], ‘std’: [58.395, 57.12, 57.375], ‘to_rgb’: True}, {‘type’: ‘Pad’, ‘size_divisor’: 32}, {‘type’: ‘DefaultFormatBundle’}, {‘type’: ‘Collect’, ‘keys’: [‘img’, ‘gt_bboxes’, ‘gt_labels’]}], ‘test_pipeline’: [{‘type’: ‘LoadImageFromFile’}, {‘type’: ‘MultiScaleFlipAug’, ‘img_scale’: (1333, 800), ‘flip’: False, ‘transforms’: [{‘type’: ‘Resize’, ‘keep_ratio’: True}, {‘type’: ‘RandomFlip’}, {‘type’: ‘Normalize’, ‘mean’: [123.675, 116.28, 103.53], ‘std’: [58.395, 57.12, 57.375], ‘to_rgb’: True}, {‘type’: ‘Pad’, ‘size_divisor’: 32}, {‘type’: ‘ImageToTensor’, ‘keys’: [‘img’]}, {‘type’: ‘Collect’, ‘keys’: [‘img’]}]}], ‘data’: {‘samples_per_gpu’: 4, ‘workers_per_gpu’: 0, ‘train’: {‘type’: ‘CocoDataset’, ‘ann_file’: ‘F:\source\repos\YOLOX\datasets\testimages\annotations/instances_val2017.json’, ‘img_prefix’: ‘F:\source\repos\YOLOX\datasets\testimages\val2017/’, ‘classes’: (‘0’, ‘1’, ‘2’, ‘3’), ‘pipeline’: [{‘type’: ‘LoadImageFromFile’}, {‘type’: ‘LoadAnnotations’, ‘with_bbox’: True}, {‘type’: ‘Resize’, ‘img_scale’: (1333, 800), ‘keep_ratio’: True}, {‘type’: ‘RandomFlip’, ‘flip_ratio’: 0.5}, {‘type’: ‘Normalize’, ‘mean’: [123.675, 116.28, 103.53], ‘std’: [58.395, 57.12, 57.375], ‘to_rgb’: True}, {‘type’: ‘Pad’, ‘size_divisor’: 32}, {‘type’: ‘DefaultFormatBundle’}, {‘type’: ‘Collect’, ‘keys’: [‘img’, ‘gt_bboxes’, ‘gt_labels’]}]}, ‘val’: {‘type’: ‘CocoDataset’, ‘ann_file’: ‘F:\source\repos\YOLOX\datasets\testimages\annotations/instances_val2017.json’, ‘img_prefix’: ‘F:\source\repos\YOLOX\datasets\testimages\val2017/’, ‘classes’: (‘0’, ‘1’, ‘2’, ‘3’), ‘pipeline’: [{‘type’: ‘LoadImageFromFile’}, {‘type’: ‘MultiScaleFlipAug’, ‘img_scale’: (1333, 800), ‘flip’: False, ‘transforms’: [{‘type’: ‘Resize’, ‘keep_ratio’: True}, {‘type’: ‘RandomFlip’}, {‘type’: ‘Normalize’, ‘mean’: [123.675, 116.28, 103.53], ‘std’: [58.395, 57.12, 57.375], ‘to_rgb’: True}, {‘type’: ‘Pad’, ‘size_divisor’: 32}, {‘type’: ‘ImageToTensor’, ‘keys’: [‘img’]}, {‘type’: ‘Collect’, ‘keys’: [‘img’]}]}]}, ‘test’: {‘type’: ‘CocoDataset’, ‘ann_file’: ‘F:\source\repos\YOLOX\datasets\testimages\annotations/instances_val2017.json’, ‘img_prefix’: ‘F:\source\repos\YOLOX\datasets\testimages\val2017/’, ‘classes’: (‘0’, ‘1’, ‘2’, ‘3’), ‘pipeline’: [{‘type’: ‘LoadImageFromFile’}, {‘type’: ‘MultiScaleFlipAug’, ‘img_scale’: (1333, 800), ‘flip’: False, ‘transforms’: [{‘type’: ‘Resize’, ‘keep_ratio’: True}, {‘type’: ‘RandomFlip’}, {‘type’: ‘Normalize’, ‘mean’: [123.675, 116.28, 103.53], ‘std’: [58.395, 57.12, 57.375], ‘to_rgb’: True}, {‘type’: ‘Pad’, ‘size_divisor’: 32}, {‘type’: ‘ImageToTensor’, ‘keys’: [‘img’]}, {‘type’: ‘Collect’, ‘keys’: [‘img’]}]}]}}, ‘evaluation’: {‘interval’: 10, ‘metric’: ‘bbox’, ‘save_best’: ‘bbox_mAP’}, ‘optimizer’: {‘type’: ‘SGD’, ‘lr’: 0.02, ‘momentum’: 0.9, ‘weight_decay’: 0.0001}, ‘optimizer_config’: {‘grad_clip’: None}, ‘lr_config’: {‘policy’: ‘step’, ‘warmup’: ‘linear’, ‘warmup_iters’: 500, ‘warmup_ratio’: 0.001, ‘step’: [8, 11]}, ‘runner’: {‘type’: ‘EpochBasedRunner’, ‘max_epochs’: 100}, ‘checkpoint_config’: {‘interval’: 1}, ‘log_config’: {‘interval’: 1, ‘hooks’: [{‘type’: ‘TextLoggerHook’}]}, ‘custom_hooks’: [{‘type’: ‘NumClassCheckHook’}], ‘dist_params’: {‘backend’: ‘nccl’}, ‘log_level’: ‘INFO’, ‘load_from’: None, ‘resume_from’: None, ‘workflow’: [(‘train’, 1)], ‘seed’: 1234, ‘gpu_ids’: [0], ‘work_dir’: ‘’, ‘total_epochs’: 100}
Issue Analytics
- State:
- Created 2 years ago
- Comments:9
@MaxVanDijck from 0.12 to 0.00012
I’ve added this line to the config, but unfortunately still the same result. I’ve tried it with the mini-coco128 downloaded from the YOLOX documentation an it seems to work fine without any NaNs. Therefore this really seems to be dataset dependent, even though I don’t really know what causes it. But since this seems to have nothing to do with the detectors implmentation I’ll close this issue for now. If I find out why this happens I will comment again