question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training MVXNet fails on custom dataset after a few epochs. Nan in loss then CUDA error.

See original GitHub issue

First after a few epochs losses got nan:

2021-04-24 14:46:51,719 - mmdet - INFO - Epoch [6][3950/8195]   lr: 2.886e-03, eta: 2 days, 12:01:48, time: 0.764, data_time: 0.006, memory: 4553, loss_cls: 0.3132, loss_bbox: 0.8662, loss_dir: 0.0910, loss: 1.2704, grad_norm: 2.3257
2021-04-24 14:47:31,521 - mmdet - INFO - Epoch [6][4000/8195]   lr: 2.886e-03, eta: 2 days, 12:01:20, time: 0.796, data_time: 0.006, memory: 4553, loss_cls: 0.3746, loss_bbox: 0.9762, loss_dir: 0.0898, loss: 1.4406, grad_norm: 2.3565
2021-04-24 14:48:11,032 - mmdet - INFO - Epoch [6][4050/8195]   lr: 2.886e-03, eta: 2 days, 12:00:50, time: 0.790, data_time: 0.006, memory: 4553, loss_cls: 0.3404, loss_bbox: 0.8319, loss_dir: 0.0908, loss: 1.2632, grad_norm: 2.4758
2021-04-24 14:48:50,557 - mmdet - INFO - Epoch [6][4100/8195]   lr: 2.886e-03, eta: 2 days, 12:00:20, time: 0.790, data_time: 0.006, memory: 4553, loss_cls: 0.3433, loss_bbox: 0.8730, loss_dir: 0.0953, loss: 1.3115, grad_norm: 2.6249
2021-04-24 14:49:29,933 - mmdet - INFO - Epoch [6][4150/8195]   lr: 2.886e-03, eta: 2 days, 11:59:49, time: 0.788, data_time: 0.006, memory: 4553, loss_cls: 0.3695, loss_bbox: 0.8402, loss_dir: 0.0922, loss: 1.3018, grad_norm: 2.6444
2021-04-24 14:50:09,230 - mmdet - INFO - Epoch [6][4200/8195]   lr: 2.886e-03, eta: 2 days, 11:59:18, time: 0.786, data_time: 0.006, memory: 4553, loss_cls: 0.3588, loss_bbox: 0.8540, loss_dir: 0.0885, loss: 1.3013, grad_norm: 2.4400
2021-04-24 14:50:48,505 - mmdet - INFO - Epoch [6][4250/8195]   lr: 2.886e-03, eta: 2 days, 11:58:46, time: 0.785, data_time: 0.006, memory: 4553, loss_cls: 0.3558, loss_bbox: 0.8192, loss_dir: 0.0881, loss: 1.2631, grad_norm: 2.4894
2021-04-24 14:51:27,110 - mmdet - INFO - Epoch [6][4300/8195]   lr: 2.886e-03, eta: 2 days, 11:58:10, time: 0.772, data_time: 0.006, memory: 4553, loss_cls: 13.6469, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2021-04-24 14:52:06,332 - mmdet - INFO - Epoch [6][4350/8195]   lr: 2.886e-03, eta: 2 days, 11:57:39, time: 0.784, data_time: 0.006, memory: 4553, loss_cls: 21.8341, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2021-04-24 14:52:43,261 - mmdet - INFO - Epoch [6][4400/8195]   lr: 2.886e-03, eta: 2 days, 11:56:52, time: 0.739, data_time: 0.006, memory: 4553, loss_cls: 21.8341, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2021-04-24 14:53:22,458 - mmdet - INFO - Epoch [6][4450/8195]   lr: 2.886e-03, eta: 2 days, 11:56:20, time: 0.784, data_time: 0.006, memory: 4553, loss_cls: 21.8341, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2021-04-24 14:54:00,749 - mmdet - INFO - Epoch [6][4500/8195]   lr: 2.886e-03, eta: 2 days, 11:55:43, time: 0.766, data_time: 0.006, memory: 4553, loss_cls: 21.8341, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2021-04-24 14:54:38,475 - mmdet - INFO - Epoch [6][4550/8195]   lr: 2.886e-03, eta: 2 days, 11:55:01, time: 0.755, data_time: 0.006, memory: 4553, loss_cls: 21.8341, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2021-04-24 14:55:17,240 - mmdet - INFO - Epoch [6][4600/8195]   lr: 2.886e-03, eta: 2 days, 11:54:27, time: 0.775, data_time: 0.006, memory: 4553, loss_cls: 21.8341, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2021-04-24 14:55:55,833 - mmdet - INFO - Epoch [6][4650/8195]   lr: 2.886e-03, eta: 2 days, 11:53:51, time: 0.772, data_time: 0.006, memory: 4553, loss_cls: 21.8341, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan

then after a few more just got cuda error:

2021-04-25 06:20:34,017 - mmdet - INFO - Epoch [15][1150/8195]  lr: 2.181e-03, eta: 1 day, 20:29:04, time: 0.739, data_time: 0.006, memory: 4553, loss_cls: 21.8341, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2021-04-25 06:21:10,596 - mmdet - INFO - Epoch [15][1200/8195]  lr: 2.181e-03, eta: 1 day, 20:28:24, time: 0.732, data_time: 0.006, memory: 4553, loss_cls: 21.8341, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2021-04-25 06:21:47,305 - mmdet - INFO - Epoch [15][1250/8195]  lr: 2.181e-03, eta: 1 day, 20:27:44, time: 0.734, data_time: 0.006, memory: 4553, loss_cls: 21.8341, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
Traceback (most recent call last):
  File "tools/train.py", line 212, in <module>
    main()
  File "tools/train.py", line 208, in main
    meta=meta)
  File "/home/kirilly/mvx_env/lib/python3.6/site-packages/mmdet/apis/train.py", line 170, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/kirilly/mvx_env/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/kirilly/mvx_env/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
    self.call_hook('after_train_iter')
  File "/home/kirilly/mvx_env/lib/python3.6/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/kirilly/mvx_env/lib/python3.6/site-packages/mmcv/runner/hooks/optimizer.py", line 27, in after_train_iter
    runner.outputs['loss'].backward()
  File "/home/kirilly/mvx_env/lib/python3.6/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/kirilly/mvx_env/lib/python3.6/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Traceback (most recent call last):
  File "/home/kirilly/mvx_env/lib/python3.6/weakref.py", line 624, in _exitfunc
    f()
  File "/home/kirilly/mvx_env/lib/python3.6/weakref.py", line 548, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
  File "/home/kirilly/mvx_env/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 1046, in core
    dealloc.add_item(module_unload, handle)
  File "/home/kirilly/mvx_env/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 599, in add_item
    self.clear()
  File "/home/kirilly/mvx_env/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 610, in clear
    dtor(handle)
  File "/home/kirilly/mvx_env/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 294, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/home/kirilly/mvx_env/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 329, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [700] Call to cuMemFree results in UNKNOWN_CUDA_ERROR

environment:

sys.platform: linux
Python: 3.6.9 (default, Jan 26 2021, 15:33:00) [GCC 8.4.0]
CUDA available: True
GPU 0,1,2,3: GeForce GTX TITAN X
CUDA_HOME: /usr/local/cuda-10.2
NVCC: Cuda compilation tools, release 10.2, V10.2.89
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.5.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2019.0.5 Product Build 20190808 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

TorchVision: 0.6.0
OpenCV: 4.5.1
MMCV: 1.3.1
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 10.2
MMDetection: 2.11.0
MMDetection3D: 0.12.0+f7356f4

Strange that cuda not detected by the script, but the training works fine.

Can you please tell me what could be the issue? Where is better to look?

Thank you for the help!!!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (4 by maintainers)

github_iconTop GitHub Comments

4reactions
Yaziwelcommented, Aug 4, 2021

Setting a lower learning rate could solve this problem. I set the lr=0.0001 which works fine.

1reaction
lysukhincommented, Sep 13, 2021

I faced similar error while training MVX-Net on KITTI. I used the provided config file, but with samples_per_gpu = 1 on 4 GPUs (having only 11GB of VRAM on single GPU). Lowering the LR to 3e-4 seemed to stabilize training.

Read more comments on GitHub >

github_iconTop Results From Across the Web

NaN loss when training regression network - Stack Overflow
I was running into my loss function suddenly returning a nan after it go so far into the training process. I checked the...
Read more >
Loss turns into 'nan' when running on GPU #1244 - GitHub
The model is trained on a single GPU machine using CUDA 10.0. During training after a few epochs, individual losses are finite numbers...
Read more >
Common Causes of NANs During Training
Common Causes of NANs During Training · Gradient blow up · Bad learning rate policy and params · Faulty Loss function · Faulty...
Read more >
Frequently Asked Questions - MMDetection's documentation!
Install mmcv-full following the installation instruction. PyTorch/CUDA Environment¶. “RTX 30 series card fails when building MMCV or MMDet”.
Read more >
Getting NaN for loss - General Discussion - TensorFlow Forum
Here is the code that is output NaN from the output layer (As a ... In brief, here the training layers flow goes...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found