FloatingPointError: Loss became infinite or NaN at iteration=27!
See original GitHub issueWhen I trained on the coco training set, I got the following error:
FloatingPointError: Loss became infinite or NaN at iteration=27! loss_dict = {‘loss_cls’: tensor(nan, device=‘cuda:1’, grad_fn=<NllLossBackward>), ‘loss_box_reg’: tensor(0.2576, device=‘cuda:1’, grad_fn=<DivBackward0>)}
How To Reproduce the Issue
python tools/train_net.py --num-gpus 8 --config-file configs/COCO-Detection/fast_rcnn_R_50_FPN_1x.yaml
Expected behavior
Training without error.
Environment
Please paste the output of python -m detectron2.utils.collect_env
.
sys.platform linux Python 3.7.5 (default, Oct 25 2019, 15:51:11) [GCC 7.3.0] Numpy 1.17.4 Detectron2 Compiler GCC 7.3 Detectron2 CUDA Compiler 10.1 DETECTRON2_ENV_MODULE <not set> PyTorch 1.3.1 PyTorch Debug Build False torchvision 0.4.2 CUDA available True GPU 0,1,2,3,4,5,6,7 Tesla P40 CUDA_HOME /usr/local/cuda NVCC Cuda compilation tools, release 10.1, V10.1.168 Pillow 6.2.1 cv2 4.1.2
PyTorch built with:
- GCC 7.3
- Intel® Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel® 64 architecture applications
- Intel® MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CUDA Runtime 10.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
- CuDNN 7.6.3
- Magma 2.5.1
- Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
Issue Analytics
- State:
- Created 4 years ago
- Comments:7
Top GitHub Comments
@ppwwyyxx @dulucas I have solved the issue. There was something wrong with my GPU Driver. After updating corresponding drivers, everything works well. I will close the issue. Thank you for your help.
I got the same error because I forgot to call
resume_or_load
..