question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory Leak when single GPU Test on 2.7+ images

See original GitHub issue

Describe the bug Testing a pre-trained cityscapes model using a single gpu for train images leads to exhaustion of RAM memory.

Reproduction

  1. What command or script did you run?

     python tools/test.py configs/fcn/fcn_r50-d8_512x1024_80k_cityscapes.py  {checkpoint}  --data_path {path of the data} --eval mIoU  ( without using distributed data parallel params)
    
  2. Did you make any modifications on the code or config? Did you understand what you have modified? Added data_path parse argument to input the path to cityscapes dataset. Changed test paths to train paths to test the model on training images.

  3. What dataset did you use? Cityscapes and also observed the same with a custom dataset Environment

  4. Please run python mmseg/utils/collect_env.py to collect necessary environment infomation and paste it here. sys.platform: linux Python: 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) [GCC 7.3.0] CUDA available: False GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609 PyTorch: 1.3.0 PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • Intel® Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel® 64 architecture applications
  • Intel® MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.4.1a0+d94043a OpenCV: 4.4.0 MMCV: 1.2.0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.1 MMSegmentation: 0.8.0+0d10921

  1. You may add addition that may be helpful for locating the problem, such as
    • I tried to switch to pytorch1.6.0 and also changed corresponding mmcv library but still the problem still persists. I also tried with the latest master its still the same.
    • I have tried using memory_profiler to locate the memory leak but this did not help.
    • Tried with setting num_workers to 0 and also LRU_CACHE_CAPACITY=1 to avoid excessive memory usage.
    • I also observed memory exhaustions during training the model on cityscapes and my custom dataset . Like RAM with 60GB exhausts after 20k epochs for cityscapes.
    • Testing the model on cityscapes validation also leads to continuous increase in memory usage but since the number of val images are 500 and my RAM allocated is 60GB this does not crash.

Error traceback

I’m running my code on a single node on a headless SLURM so I cannot perform any interactive debugging. I have not made any changes to the source code except those mentioned above. I’m trying to debug this since a week but still no luck. Please let me know if you can find a solution for my problem.

A placeholder for trackback.

Bug fix

If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:12

github_iconTop GitHub Comments

2reactions
MELSunnycommented, Mar 8, 2022

The reason why memory over leak is https://github.com/open-mmlab/mmsegmentation/blob/e8cc3224e1b44ae44bf7f22f356f64059a8d82b9/mmseg/apis/test.py#L91 result is a list, including one prediction label, with np.int64 type. https://github.com/open-mmlab/mmsegmentation/blob/e8cc3224e1b44ae44bf7f22f356f64059a8d82b9/mmseg/apis/test.py#L125-L131 It is appended into results for every loop, which will cause a memory explosion if testing on a large dataset. To solve it, just add a line before Line 125 result = [_.astype(np.uint16) for _ in result]

It will reduce the ram usage x4 less. If the num_classes of the dataset is less than 254, ‘np.uint8’ can be considered which reduce the ram usage x8 less #189

1reaction
xvjiaruicommented, Jan 13, 2021

Yep. We have support memory efficient test in #330

Read more comments on GitHub >

github_iconTop Results From Across the Web

Keras / Tensorflow suspected memory leak - Stack Overflow
I'm attaching an image of monitoring RAM usage over approx. 15 hours, and my question is simple - does it look like it...
Read more >
Running out of GPU memory with just 3 samples of ...
Hi, I'm training a model with model.fitDataset. The input dimensions are [480, 640, 3] with just 4 outputs of size [1, 4] and...
Read more >
T97379 Memory Leak in Blender 3.1.2 Metal CPU + GPU
I've been encountering a consistent Memory Leak using the new 3.1.2 Version, rendering in Cycles "Metal", on both GPUs and the CPU.
Read more >
Debugging memory leaks — Scrapy 2.7.1 documentation
This is, by far, the most common cause of memory leaks in Scrapy projects, and a quite difficult one to debug for newcomers....
Read more >
Metal Memory Debugging - davidpritchard.org
This suggested that there was not a true memory leak — clearly our code only referenced 840MB of memory, and once the operating...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found