GPU usage keeps increasing until OOM error on iSAID dataset.
See original GitHub issueIf you do not know the root cause of the problem / bug, and wish someone to help you, please include:
How To Reproduce the Issue
Run a simple training with any detectron2 backbone on iSAID dataset https://captain-whu.github.io/iSAID/. iSAID is a instance segmentation dataset with COCO-style json data, using 15 object categories and having some images with a big number of instances(cars). iSAID is preprocessed by the author’s script which convert labels bboxes and metadata to the COCO format, while creating 800x800 patches of the high resolution original images.
-
what changes you made (
git diff
) or what code you wrote I used the simple detectron2 colab tutorial code, using a register_coco_instances function instead of defining a custom function, as ISAID is fully compatible with COCO format. Here is a link to the code for reproducing the error: https://drive.google.com/open?id=1bo0GOhHLlvEyc6E9DOZzlszg59THOT9x -
what exact command you run python3 training_naive.py, which runs a register_coco_instances function, a cfg setup, and then a simple DefaultTrainer.train()
-
what you observed (including the full logs):
The GPU memory usage keeps increasing after several iterations, until a crash for out of memory error. Using torch.cuda_empty_cache() or the suggested
cfg.MODEL.RPN.PRE_NMS_TOPK_TRAIN = 200
cfg.MODEL.RPN.POST_NMS_TOPK_TRAIN = 200
did not solved neither.
Here is the link to the full output from bash: [https://drive.google.com/open?id=1SszOAY9pEBFSsfp7nyc0Gv_mcCiHoAKo](url)
Expected behavior
If there are no obvious error in “what you observed” provided above, please tell us the expected behavior.
If you expect the model to work better, note that we do not help you train your model. Only in one of the two conditions we will help with it: (1) You’re unable to reproduce the results in detectron2 model zoo. (2) It indicates a detectron2 bug.
Environment
Please paste the output of python -m detectron2.utils.collect_env
.
If detectron2 hasn’t been successfully installed,
use python detectron2/utils/collect_env.py
.
(pytorch) paolo@ALCOR-TITANV-WS:~/libriaries/prove_detectron2$ python -m detectron2.utils.collect_env
sys.platform linux Python 3.6.8 (default, Oct 9 2019, 14:04:01) [GCC 5.4.0 20160609] Numpy 1.17.4 Detectron2 Compiler GCC 5.4 Detectron2 CUDA Compiler 10.1 DETECTRON2_ENV_MODULE <not set> PyTorch 1.3.1 PyTorch Debug Build False torchvision 0.4.2 CUDA available True GPU 0,1,2,3 TITAN V CUDA_HOME /usr/local/cuda-10.1 NVCC Cuda compilation tools, release 10.1, V10.1.105 Pillow 6.2.1 cv2 4.1.2
PyTorch built with:
- GCC 7.3
- Intel® Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel® 64 architecture applications
- Intel® MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CUDA Runtime 10.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
- CuDNN 7.6.3
- Magma 2.5.1
- Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
(pytorch) paolo@ALCOR-TITANV-WS:~/libriaries/prove_detectron2$
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:6
Top GitHub Comments
Update: the issue disappear if I remove the 100 images which show the highest number of instances. Top 100 images have 700 instances per image, with the top-10 images having 3000 instances. Still, PANnet official implementation, which is heavily based on detectron 1 code, is able to run on the whole dataset without any issue.
Similar problem while using collab and custom data set. Solved by tinkering with
cfg
settings.