question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FloatingPointError: Loss became infinite or NaN at iteration=27!

See original GitHub issue

When I trained on the coco training set, I got the following error:

FloatingPointError: Loss became infinite or NaN at iteration=27! loss_dict = {‘loss_cls’: tensor(nan, device=‘cuda:1’, grad_fn=<NllLossBackward>), ‘loss_box_reg’: tensor(0.2576, device=‘cuda:1’, grad_fn=<DivBackward0>)}

How To Reproduce the Issue

python tools/train_net.py --num-gpus 8 --config-file configs/COCO-Detection/fast_rcnn_R_50_FPN_1x.yaml

Expected behavior

Training without error.

Environment

Please paste the output of python -m detectron2.utils.collect_env.


sys.platform linux Python 3.7.5 (default, Oct 25 2019, 15:51:11) [GCC 7.3.0] Numpy 1.17.4 Detectron2 Compiler GCC 7.3 Detectron2 CUDA Compiler 10.1 DETECTRON2_ENV_MODULE <not set> PyTorch 1.3.1 PyTorch Debug Build False torchvision 0.4.2 CUDA available True GPU 0,1,2,3,4,5,6,7 Tesla P40 CUDA_HOME /usr/local/cuda NVCC Cuda compilation tools, release 10.1, V10.1.168 Pillow 6.2.1 cv2 4.1.2


PyTorch built with:

  • GCC 7.3
  • Intel® Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel® 64 architecture applications
  • Intel® MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CUDA Runtime 10.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  • CuDNN 7.6.3
  • Magma 2.5.1
  • Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7

github_iconTop GitHub Comments

2reactions
ChongjianGEcommented, Dec 6, 2019

@ppwwyyxx @dulucas I have solved the issue. There was something wrong with my GPU Driver. After updating corresponding drivers, everything works well. I will close the issue. Thank you for your help.

1reaction
PeterZainzingercommented, Jan 12, 2020

I got the same error because I forgot to call resume_or_load.

.

Read more comments on GitHub >

github_iconTop Results From Across the Web

File "/home/jake/detectron2/detectron2/engine/train_loop.py ...
FloatingPointError : Loss became infinite or NaN at iteration=1! loss_dict = {'loss_cls_stage0': 1.613979458808899, 'loss_box_reg_stage0': ...
Read more >
报错FloatingPointError: Loss became infinite or NaN at ...
报错FloatingPointError: Loss became infinite or NaN at iteration=1099! FloatingPointError: Loss became infinite or NaN at iteration=1099!
Read more >
loss became infinite or Nan at iteration = 1099! - 文章整合
Report errors FloatingPointError: Loss became infinite or NaN at iter.
Read more >
程序员资料_floatingpointerror: loss became infinite or nan at
报错FloatingPointError: Loss became infinite or NaN at iteration=1099![04/01 15:05:09] d2.engine.train_loop ERROR: Exception during training:Traceback (most ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found