Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FloatingPointError: Loss became infinite or NaN at iteration=27!

See original GitHub issue

When I trained on the coco training set, I got the following error：

FloatingPointError: Loss became infinite or NaN at iteration=27! loss_dict = {‘loss_cls’: tensor(nan, device=‘cuda:1’, grad_fn=<NllLossBackward>), ‘loss_box_reg’: tensor(0.2576, device=‘cuda:1’, grad_fn=<DivBackward0>)}

How To Reproduce the Issue

python tools/train_net.py --num-gpus 8 --config-file configs/COCO-Detection/fast_rcnn_R_50_FPN_1x.yaml

Expected behavior

Training without error.

Environment

Please paste the output of python -m detectron2.utils.collect_env.

sys.platform linux Python 3.7.5 (default, Oct 25 2019, 15:51:11) [GCC 7.3.0] Numpy 1.17.4 Detectron2 Compiler GCC 7.3 Detectron2 CUDA Compiler 10.1 DETECTRON2_ENV_MODULE <not set> PyTorch 1.3.1 PyTorch Debug Build False torchvision 0.4.2 CUDA available True GPU 0,1,2,3,4,5,6,7 Tesla P40 CUDA_HOME /usr/local/cuda NVCC Cuda compilation tools, release 10.1, V10.1.168 Pillow 6.2.1 cv2 4.1.2

PyTorch built with:

GCC 7.3
Intel® Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel® 64 architecture applications
Intel® MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CUDA Runtime 10.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
CuDNN 7.6.3
Magma 2.5.1
Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

Issue Analytics

State:
Created 4 years ago
Comments:7

Top GitHub Comments

2reactions

ChongjianGEcommented, Dec 6, 2019

@ppwwyyxx @dulucas I have solved the issue. There was something wrong with my GPU Driver. After updating corresponding drivers, everything works well. I will close the issue. Thank you for your help.

1reaction

PeterZainzingercommented, Jan 12, 2020

I got the same error because I forgot to call resume_or_load.

Top Results From Across the Web

File "/home/jake/detectron2/detectron2/engine/train_loop.py ...

FloatingPointError : Loss became infinite or NaN at iteration=1! loss_dict = {'loss_cls_stage0': 1.613979458808899, 'loss_box_reg_stage0': ...

报错FloatingPointError: Loss became infinite or NaN at ...

报错FloatingPointError: Loss became infinite or NaN at iteration=1099! FloatingPointError: Loss became infinite or NaN at iteration=1099!

loss became infinite or Nan at iteration = 1099! - 文章整合

Report errors FloatingPointError: Loss became infinite or NaN at iter.

程序员资料_floatingpointerror: loss became infinite or nan at

报错FloatingPointError: Loss became infinite or NaN at iteration=1099![04/01 15:05:09] d2.engine.train_loop ERROR: Exception during training:Traceback (most ...