Memory Leak when single GPU Test on 2.7+ images
See original GitHub issueDescribe the bug Testing a pre-trained cityscapes model using a single gpu for train images leads to exhaustion of RAM memory.
Reproduction
-
What command or script did you run?
python tools/test.py configs/fcn/fcn_r50-d8_512x1024_80k_cityscapes.py {checkpoint} --data_path {path of the data} --eval mIoU ( without using distributed data parallel params)
-
Did you make any modifications on the code or config? Did you understand what you have modified? Added data_path parse argument to input the path to cityscapes dataset. Changed test paths to train paths to test the model on training images.
-
What dataset did you use? Cityscapes and also observed the same with a custom dataset Environment
-
Please run
python mmseg/utils/collect_env.py
to collect necessary environment infomation and paste it here. sys.platform: linux Python: 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) [GCC 7.3.0] CUDA available: False GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609 PyTorch: 1.3.0 PyTorch compiling details: PyTorch built with:
- GCC 7.3
- Intel® Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel® 64 architecture applications
- Intel® MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
TorchVision: 0.4.1a0+d94043a OpenCV: 4.4.0 MMCV: 1.2.0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.1 MMSegmentation: 0.8.0+0d10921
- You may add addition that may be helpful for locating the problem, such as
- I tried to switch to pytorch1.6.0 and also changed corresponding mmcv library but still the problem still persists. I also tried with the latest master its still the same.
- I have tried using memory_profiler to locate the memory leak but this did not help.
- Tried with setting num_workers to 0 and also LRU_CACHE_CAPACITY=1 to avoid excessive memory usage.
- I also observed memory exhaustions during training the model on cityscapes and my custom dataset . Like RAM with 60GB exhausts after 20k epochs for cityscapes.
- Testing the model on cityscapes validation also leads to continuous increase in memory usage but since the number of val images are 500 and my RAM allocated is 60GB this does not crash.
Error traceback
I’m running my code on a single node on a headless SLURM so I cannot perform any interactive debugging. I have not made any changes to the source code except those mentioned above. I’m trying to debug this since a week but still no luck. Please let me know if you can find a solution for my problem.
A placeholder for trackback.
Bug fix
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:12
The reason why memory over leak is https://github.com/open-mmlab/mmsegmentation/blob/e8cc3224e1b44ae44bf7f22f356f64059a8d82b9/mmseg/apis/test.py#L91
result
is a list, including one prediction label, withnp.int64
type. https://github.com/open-mmlab/mmsegmentation/blob/e8cc3224e1b44ae44bf7f22f356f64059a8d82b9/mmseg/apis/test.py#L125-L131 It is appended into results for every loop, which will cause a memory explosion if testing on a large dataset. To solve it, just add a line before Line 125result = [_.astype(np.uint16) for _ in result]
It will reduce the ram usage x4 less. If the num_classes of the dataset is less than 254, ‘np.uint8’ can be considered which reduce the ram usage x8 less #189
Yep. We have support memory efficient test in #330