Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory Leak when single GPU Test on 2.7+ images

See original GitHub issue

Describe the bug Testing a pre-trained cityscapes model using a single gpu for train images leads to exhaustion of RAM memory.

Reproduction

What command or script did you run?

 python tools/test.py configs/fcn/fcn_r50-d8_512x1024_80k_cityscapes.py  {checkpoint}  --data_path {path of the data} --eval mIoU  ( without using distributed data parallel params)

Did you make any modifications on the code or config? Did you understand what you have modified? Added data_path parse argument to input the path to cityscapes dataset. Changed test paths to train paths to test the model on training images.
What dataset did you use? Cityscapes and also observed the same with a custom dataset Environment
Please run python mmseg/utils/collect_env.py to collect necessary environment infomation and paste it here. sys.platform: linux Python: 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) [GCC 7.3.0] CUDA available: False GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609 PyTorch: 1.3.0 PyTorch compiling details: PyTorch built with:

GCC 7.3
Intel® Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel® 64 architecture applications
Intel® MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.4.1a0+d94043a OpenCV: 4.4.0 MMCV: 1.2.0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.1 MMSegmentation: 0.8.0+0d10921

You may add addition that may be helpful for locating the problem, such as
- I tried to switch to pytorch1.6.0 and also changed corresponding mmcv library but still the problem still persists. I also tried with the latest master its still the same.
- I have tried using memory_profiler to locate the memory leak but this did not help.
- Tried with setting num_workers to 0 and also LRU_CACHE_CAPACITY=1 to avoid excessive memory usage.
- I also observed memory exhaustions during training the model on cityscapes and my custom dataset . Like RAM with 60GB exhausts after 20k epochs for cityscapes.
- Testing the model on cityscapes validation also leads to continuous increase in memory usage but since the number of val images are 500 and my RAM allocated is 60GB this does not crash.

Error traceback

I’m running my code on a single node on a headless SLURM so I cannot perform any interactive debugging. I have not made any changes to the source code except those mentioned above. I’m trying to debug this since a week but still no luck. Please let me know if you can find a solution for my problem.

A placeholder for trackback.

Bug fix

If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:12

Top GitHub Comments

2reactions

MELSunnycommented, Mar 8, 2022

The reason why memory over leak is https://github.com/open-mmlab/mmsegmentation/blob/e8cc3224e1b44ae44bf7f22f356f64059a8d82b9/mmseg/apis/test.py#L91 result is a list, including one prediction label, with np.int64 type. https://github.com/open-mmlab/mmsegmentation/blob/e8cc3224e1b44ae44bf7f22f356f64059a8d82b9/mmseg/apis/test.py#L125-L131 It is appended into results for every loop, which will cause a memory explosion if testing on a large dataset. To solve it, just add a line before Line 125 result = [_.astype(np.uint16) for _ in result]

It will reduce the ram usage x4 less. If the num_classes of the dataset is less than 254, ‘np.uint8’ can be considered which reduce the ram usage x8 less #189

1reaction

xvjiaruicommented, Jan 13, 2021

Yep. We have support memory efficient test in #330