question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DetectoRS loss is always NaN

See original GitHub issue

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. I have read the FAQ documentation but cannot get the expected help.
  3. The bug has not been fixed in the latest version.

Describe the bug The loss using any config in detectors is always nan using my own Coco style dataset(s) .Every other model in this repo works fine on the dataset (Object detection as well as instance segmentation). I’ve tried every config in the detectors config folder for object detection and double checked the dataset that no 0 size boxes are in there. I’ve pasted the config file at the end of the issue if that helps. I also looked through all the issues and issue 6740 seemed to have the same problem but was closed Reproduction

  1. What command or script did you run?
model = build_detector(self.cfg.model,
                                    train_cfg=self.cfg.get("train_cfg"),
                                    test_cfg=self.cfg.get('test_cfg'))
datasets = [build_dataset(self.cfg.data.train)]
train_detector(self.model, datasets[0], self.cfg, distributed=False, validate=True)

where self.cfg is the config at the end of this issue.

  1. Did you make any modifications on the code or config? Did you understand what you have modified?

    The only things I’ve changed are dataset related (data_root, classes,…), samples_per_gpu and the num_classes. But since every other model works just fine (FasterRcnn, MaskRCNN, YOLOX…) I don’t really know where the problem is.

  2. What dataset did you use? My custom Dataset

Environment

  1. Please run python mmdet/utils/collect_env.py

sys.platform: win32 Python: 3.8.10 (default, May 19 2021, 13:12:57) [MSC v.1916 64 bit (AMD64)] CUDA available: True GPU 0: NVIDIA GeForce RTX 3090 CUDA_HOME: None GCC: n/a PyTorch: 1.9.0 PyTorch compiling details: PyTorch built with: C++ Version: 199711 MSVC 192829337 Intel® Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel® 64 architecture applications Intel® MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb) OpenMP 2019 CPU capability usage: AVX2 CUDA Runtime 11.1 NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37 CuDNN 8.0.5 Magma 2.5.4 Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=C:/cb/pytorch_1000000000000/work/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/cb/pytorch_1000000000000/work/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, TorchVision: 0.10.0 OpenCV: 4.5.3 MMCV: 1.4.3 MMCV Compiler: MSVC 192930137 MMCV CUDA Compiler: 11.1 MMDetection: 2.20.0+ff9bc39

Error traceback The first few steps of the training log (The rest looks about the same):

2022-02-01 17:54:32,698 - mmdet - INFO - Epoch [1][1/93] lr: 2.000e-05, eta: 0:06:19, time: 4.122, data_time: 2.085, memory: 7863, loss_rpn_cls: nan, loss_rpn_bbox: nan, s0.loss_cls: nan, s0.acc: 0.0000, s0.loss_bbox: nan, s1.loss_cls: nan, s1.acc: 0.0000, s1.loss_bbox: nan, s2.loss_cls: nan, s2.acc: 0.0000, s2.loss_bbox: nan, loss: nan 2022-02-01 17:54:33,278 - mmdet - INFO - Epoch [1][2/93] lr: 5.996e-05, eta: 0:03:33, time: 0.579, data_time: 0.077, memory: 8430, loss_rpn_cls: nan, loss_rpn_bbox: nan, s0.loss_cls: nan, s0.acc: 0.0000, s0.loss_bbox: nan, s1.loss_cls: nan, s1.acc: 0.0000, s1.loss_bbox: nan, s2.loss_cls: nan, s2.acc: 0.0000, s2.loss_bbox: nan, loss: nan 2022-02-01 17:54:33,834 - mmdet - INFO - Epoch [1][3/93] lr: 9.992e-05, eta: 0:02:37, time: 0.556, data_time: 0.077, memory: 8430, loss_rpn_cls: nan, loss_rpn_bbox: nan, s0.loss_cls: nan, s0.acc: 1.4205, s0.loss_bbox: nan, s1.loss_cls: nan, s1.acc: 1.4205, s1.loss_bbox: nan, s2.loss_cls: nan, s2.acc: 1.4205, s2.loss_bbox: nan, loss: nan 2022-02-01 17:54:34,410 - mmdet - INFO - Epoch [1][4/93] lr: 1.399e-04, eta: 0:02:09, time: 0.576, data_time: 0.077, memory: 8430, loss_rpn_cls: nan, loss_rpn_bbox: nan, s0.loss_cls: nan, s0.acc: 0.0000, s0.loss_bbox: nan, s1.loss_cls: nan, s1.acc: 0.0000, s1.loss_bbox: nan, s2.loss_cls: nan, s2.acc: 0.0000, s2.loss_bbox: nan, loss: nan

Just one config as an example, but this happens for every config in the detectors folder.

{‘model’: {‘type’: ‘CascadeRCNN’, ‘backbone’: {‘type’: ‘DetectoRS_ResNet’, ‘depth’: 50, ‘num_stages’: 4, ‘out_indices’: (0, 1, 2, 3), ‘frozen_stages’: 1, ‘norm_cfg’: {‘type’: ‘BN’, ‘requires_grad’: True}, ‘norm_eval’: True, ‘style’: ‘pytorch’, ‘init_cfg’: {‘type’: ‘Pretrained’, ‘checkpoint’: ‘torchvision://resnet50’}, ‘conv_cfg’: {‘type’: ‘ConvAWS’}, ‘output_img’: True}, ‘neck’: {‘type’: ‘RFP’, ‘in_channels’: [256, 512, 1024, 2048], ‘out_channels’: 256, ‘num_outs’: 5, ‘rfp_steps’: 2, ‘aspp_out_channels’: 64, ‘aspp_dilations’: (1, 3, 6, 1), ‘rfp_backbone’: {‘rfp_inplanes’: 256, ‘type’: ‘DetectoRS_ResNet’, ‘depth’: 50, ‘num_stages’: 4, ‘out_indices’: (0, 1, 2, 3), ‘frozen_stages’: 1, ‘norm_cfg’: {‘type’: ‘BN’, ‘requires_grad’: True}, ‘norm_eval’: True, ‘conv_cfg’: {‘type’: ‘ConvAWS’}, ‘pretrained’: ‘torchvision://resnet50’, ‘style’: ‘pytorch’}}, ‘rpn_head’: {‘type’: ‘RPNHead’, ‘in_channels’: 256, ‘feat_channels’: 256, ‘anchor_generator’: {‘type’: ‘AnchorGenerator’, ‘scales’: [8], ‘ratios’: [0.5, 1.0, 2.0], ‘strides’: [4, 8, 16, 32, 64]}, ‘bbox_coder’: {‘type’: ‘DeltaXYWHBBoxCoder’, ‘target_means’: [0.0, 0.0, 0.0, 0.0], ‘target_stds’: [1.0, 1.0, 1.0, 1.0]}, ‘loss_cls’: {‘type’: ‘CrossEntropyLoss’, ‘use_sigmoid’: True, ‘loss_weight’: 1.0}, ‘loss_bbox’: {‘type’: ‘SmoothL1Loss’, ‘beta’: 0.1111111111111111, ‘loss_weight’: 1.0}}, ‘roi_head’: {‘type’: ‘CascadeRoIHead’, ‘num_stages’: 3, ‘stage_loss_weights’: [1, 0.5, 0.25], ‘bbox_roi_extractor’: {‘type’: ‘SingleRoIExtractor’, ‘roi_layer’: {‘type’: ‘RoIAlign’, ‘output_size’: 7, ‘sampling_ratio’: 0}, ‘out_channels’: 256, ‘featmap_strides’: [4, 8, 16, 32]}, ‘bbox_head’: [{‘type’: ‘Shared2FCBBoxHead’, ‘in_channels’: 256, ‘fc_out_channels’: 1024, ‘roi_feat_size’: 7, ‘num_classes’: 4, ‘bbox_coder’: {‘type’: ‘DeltaXYWHBBoxCoder’, ‘target_means’: [0.0, 0.0, 0.0, 0.0], ‘target_stds’: [0.1, 0.1, 0.2, 0.2]}, ‘reg_class_agnostic’: True, ‘loss_cls’: {‘type’: ‘CrossEntropyLoss’, ‘use_sigmoid’: False, ‘loss_weight’: 1.0}, ‘loss_bbox’: {‘type’: ‘SmoothL1Loss’, ‘beta’: 1.0, ‘loss_weight’: 1.0}}, {‘type’: ‘Shared2FCBBoxHead’, ‘in_channels’: 256, ‘fc_out_channels’: 1024, ‘roi_feat_size’: 7, ‘num_classes’: 4, ‘bbox_coder’: {‘type’: ‘DeltaXYWHBBoxCoder’, ‘target_means’: [0.0, 0.0, 0.0, 0.0], ‘target_stds’: [0.05, 0.05, 0.1, 0.1]}, ‘reg_class_agnostic’: True, ‘loss_cls’: {‘type’: ‘CrossEntropyLoss’, ‘use_sigmoid’: False, ‘loss_weight’: 1.0}, ‘loss_bbox’: {‘type’: ‘SmoothL1Loss’, ‘beta’: 1.0, ‘loss_weight’: 1.0}}, {‘type’: ‘Shared2FCBBoxHead’, ‘in_channels’: 256, ‘fc_out_channels’: 1024, ‘roi_feat_size’: 7, ‘num_classes’: 4, ‘bbox_coder’: {‘type’: ‘DeltaXYWHBBoxCoder’, ‘target_means’: [0.0, 0.0, 0.0, 0.0], ‘target_stds’: [0.033, 0.033, 0.067, 0.067]}, ‘reg_class_agnostic’: True, ‘loss_cls’: {‘type’: ‘CrossEntropyLoss’, ‘use_sigmoid’: False, ‘loss_weight’: 1.0}, ‘loss_bbox’: {‘type’: ‘SmoothL1Loss’, ‘beta’: 1.0, ‘loss_weight’: 1.0}}]}, ‘train_cfg’: {‘rpn’: {‘assigner’: {‘type’: ‘MaxIoUAssigner’, ‘pos_iou_thr’: 0.7, ‘neg_iou_thr’: 0.3, ‘min_pos_iou’: 0.3, ‘match_low_quality’: True, ‘ignore_iof_thr’: -1}, ‘sampler’: {‘type’: ‘RandomSampler’, ‘num’: 256, ‘pos_fraction’: 0.5, ‘neg_pos_ub’: -1, ‘add_gt_as_proposals’: False}, ‘allowed_border’: 0, ‘pos_weight’: -1, ‘debug’: False}, ‘rpn_proposal’: {‘nms_pre’: 2000, ‘max_per_img’: 2000, ‘nms’: {‘type’: ‘nms’, ‘iou_threshold’: 0.7}, ‘min_bbox_size’: 0}, ‘rcnn’: [{‘assigner’: {‘type’: ‘MaxIoUAssigner’, ‘pos_iou_thr’: 0.5, ‘neg_iou_thr’: 0.5, ‘min_pos_iou’: 0.5, ‘match_low_quality’: False, ‘ignore_iof_thr’: -1}, ‘sampler’: {‘type’: ‘RandomSampler’, ‘num’: 512, ‘pos_fraction’: 0.25, ‘neg_pos_ub’: -1, ‘add_gt_as_proposals’: True}, ‘pos_weight’: -1, ‘debug’: False}, {‘assigner’: {‘type’: ‘MaxIoUAssigner’, ‘pos_iou_thr’: 0.6, ‘neg_iou_thr’: 0.6, ‘min_pos_iou’: 0.6, ‘match_low_quality’: False, ‘ignore_iof_thr’: -1}, ‘sampler’: {‘type’: ‘RandomSampler’, ‘num’: 512, ‘pos_fraction’: 0.25, ‘neg_pos_ub’: -1, ‘add_gt_as_proposals’: True}, ‘pos_weight’: -1, ‘debug’: False}, {‘assigner’: {‘type’: ‘MaxIoUAssigner’, ‘pos_iou_thr’: 0.7, ‘neg_iou_thr’: 0.7, ‘min_pos_iou’: 0.7, ‘match_low_quality’: False, ‘ignore_iof_thr’: -1}, ‘sampler’: {‘type’: ‘RandomSampler’, ‘num’: 512, ‘pos_fraction’: 0.25, ‘neg_pos_ub’: -1, ‘add_gt_as_proposals’: True}, ‘pos_weight’: -1, ‘debug’: False}]}, ‘test_cfg’: {‘rpn’: {‘nms_pre’: 1000, ‘max_per_img’: 1000, ‘nms’: {‘type’: ‘nms’, ‘iou_threshold’: 0.7}, ‘min_bbox_size’: 0}, ‘rcnn’: {‘score_thr’: 0.05, ‘nms’: {‘type’: ‘nms’, ‘iou_threshold’: 0.5}, ‘max_per_img’: 100}}}, ‘dataset_type’: ‘CocoDataset’, ‘data_root’: ‘F:\source\repos\YOLOX\datasets\testimages’, ‘img_norm_cfg’: {‘mean’: [123.675, 116.28, 103.53], ‘std’: [58.395, 57.12, 57.375], ‘to_rgb’: True}, ‘train_pipeline’: [{‘type’: ‘LoadImageFromFile’}, {‘type’: ‘LoadAnnotations’, ‘with_bbox’: True}, {‘type’: ‘Resize’, ‘img_scale’: (1333, 800), ‘keep_ratio’: True}, {‘type’: ‘RandomFlip’, ‘flip_ratio’: 0.5}, {‘type’: ‘Normalize’, ‘mean’: [123.675, 116.28, 103.53], ‘std’: [58.395, 57.12, 57.375], ‘to_rgb’: True}, {‘type’: ‘Pad’, ‘size_divisor’: 32}, {‘type’: ‘DefaultFormatBundle’}, {‘type’: ‘Collect’, ‘keys’: [‘img’, ‘gt_bboxes’, ‘gt_labels’]}], ‘test_pipeline’: [{‘type’: ‘LoadImageFromFile’}, {‘type’: ‘MultiScaleFlipAug’, ‘img_scale’: (1333, 800), ‘flip’: False, ‘transforms’: [{‘type’: ‘Resize’, ‘keep_ratio’: True}, {‘type’: ‘RandomFlip’}, {‘type’: ‘Normalize’, ‘mean’: [123.675, 116.28, 103.53], ‘std’: [58.395, 57.12, 57.375], ‘to_rgb’: True}, {‘type’: ‘Pad’, ‘size_divisor’: 32}, {‘type’: ‘ImageToTensor’, ‘keys’: [‘img’]}, {‘type’: ‘Collect’, ‘keys’: [‘img’]}]}], ‘data’: {‘samples_per_gpu’: 4, ‘workers_per_gpu’: 0, ‘train’: {‘type’: ‘CocoDataset’, ‘ann_file’: ‘F:\source\repos\YOLOX\datasets\testimages\annotations/instances_val2017.json’, ‘img_prefix’: ‘F:\source\repos\YOLOX\datasets\testimages\val2017/’, ‘classes’: (‘0’, ‘1’, ‘2’, ‘3’), ‘pipeline’: [{‘type’: ‘LoadImageFromFile’}, {‘type’: ‘LoadAnnotations’, ‘with_bbox’: True}, {‘type’: ‘Resize’, ‘img_scale’: (1333, 800), ‘keep_ratio’: True}, {‘type’: ‘RandomFlip’, ‘flip_ratio’: 0.5}, {‘type’: ‘Normalize’, ‘mean’: [123.675, 116.28, 103.53], ‘std’: [58.395, 57.12, 57.375], ‘to_rgb’: True}, {‘type’: ‘Pad’, ‘size_divisor’: 32}, {‘type’: ‘DefaultFormatBundle’}, {‘type’: ‘Collect’, ‘keys’: [‘img’, ‘gt_bboxes’, ‘gt_labels’]}]}, ‘val’: {‘type’: ‘CocoDataset’, ‘ann_file’: ‘F:\source\repos\YOLOX\datasets\testimages\annotations/instances_val2017.json’, ‘img_prefix’: ‘F:\source\repos\YOLOX\datasets\testimages\val2017/’, ‘classes’: (‘0’, ‘1’, ‘2’, ‘3’), ‘pipeline’: [{‘type’: ‘LoadImageFromFile’}, {‘type’: ‘MultiScaleFlipAug’, ‘img_scale’: (1333, 800), ‘flip’: False, ‘transforms’: [{‘type’: ‘Resize’, ‘keep_ratio’: True}, {‘type’: ‘RandomFlip’}, {‘type’: ‘Normalize’, ‘mean’: [123.675, 116.28, 103.53], ‘std’: [58.395, 57.12, 57.375], ‘to_rgb’: True}, {‘type’: ‘Pad’, ‘size_divisor’: 32}, {‘type’: ‘ImageToTensor’, ‘keys’: [‘img’]}, {‘type’: ‘Collect’, ‘keys’: [‘img’]}]}]}, ‘test’: {‘type’: ‘CocoDataset’, ‘ann_file’: ‘F:\source\repos\YOLOX\datasets\testimages\annotations/instances_val2017.json’, ‘img_prefix’: ‘F:\source\repos\YOLOX\datasets\testimages\val2017/’, ‘classes’: (‘0’, ‘1’, ‘2’, ‘3’), ‘pipeline’: [{‘type’: ‘LoadImageFromFile’}, {‘type’: ‘MultiScaleFlipAug’, ‘img_scale’: (1333, 800), ‘flip’: False, ‘transforms’: [{‘type’: ‘Resize’, ‘keep_ratio’: True}, {‘type’: ‘RandomFlip’}, {‘type’: ‘Normalize’, ‘mean’: [123.675, 116.28, 103.53], ‘std’: [58.395, 57.12, 57.375], ‘to_rgb’: True}, {‘type’: ‘Pad’, ‘size_divisor’: 32}, {‘type’: ‘ImageToTensor’, ‘keys’: [‘img’]}, {‘type’: ‘Collect’, ‘keys’: [‘img’]}]}]}}, ‘evaluation’: {‘interval’: 10, ‘metric’: ‘bbox’, ‘save_best’: ‘bbox_mAP’}, ‘optimizer’: {‘type’: ‘SGD’, ‘lr’: 0.02, ‘momentum’: 0.9, ‘weight_decay’: 0.0001}, ‘optimizer_config’: {‘grad_clip’: None}, ‘lr_config’: {‘policy’: ‘step’, ‘warmup’: ‘linear’, ‘warmup_iters’: 500, ‘warmup_ratio’: 0.001, ‘step’: [8, 11]}, ‘runner’: {‘type’: ‘EpochBasedRunner’, ‘max_epochs’: 100}, ‘checkpoint_config’: {‘interval’: 1}, ‘log_config’: {‘interval’: 1, ‘hooks’: [{‘type’: ‘TextLoggerHook’}]}, ‘custom_hooks’: [{‘type’: ‘NumClassCheckHook’}], ‘dist_params’: {‘backend’: ‘nccl’}, ‘log_level’: ‘INFO’, ‘load_from’: None, ‘resume_from’: None, ‘workflow’: [(‘train’, 1)], ‘seed’: 1234, ‘gpu_ids’: [0], ‘work_dir’: ‘’, ‘total_epochs’: 100}

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9

github_iconTop GitHub Comments

1reaction
Guemann-uicommented, Feb 4, 2022

@MaxVanDijck from 0.12 to 0.00012

1reaction
liangularcommented, Feb 3, 2022

I’ve added this line to the config, but unfortunately still the same result. I’ve tried it with the mini-coco128 downloaded from the YOLOX documentation an it seems to work fine without any NaNs. Therefore this really seems to be dataset dependent, even though I don’t really know what causes it. But since this seems to have nothing to do with the detectors implmentation I’ll close this issue for now. If I find out why this happens I will comment again

Read more comments on GitHub >

github_iconTop Results From Across the Web

Deep-Learning Nan loss reasons - python - Stack Overflow
Deep-Learning Nan loss reasons · It's a natural property of stochastic gradient descent, if the learning rate is too large, SGD can diverge...
Read more >
Common Causes of NANs During Training
Common Causes of NANs During Training · Gradient blow up · Bad learning rate policy and params · Faulty Loss function · Faulty...
Read more >
SSD Object Detector training results in NaN loss and RMSE
When using more anchor boxes with bigger sizes it trains longer without resulting in NaN. But at somepoint NaN will still show up....
Read more >
Tensorflow Object Detection: Working Around a NAN Loss
Recently I was working on a project that required training of an object detection model in Tensorflow 2.x (version 2.4 to be specific)....
Read more >
Keras Loss Functions: Everything You Need to Know
Why Keras loss nan happens ... Most of the time, losses you log will be just some regular values, but sometimes you might...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found