Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CuPy memory leak? (Outstanding consumption.)

See original GitHub issue

Hello.

I am wondering how is managed the memory through CuPy and if there could be an issue in the way it is handled. Let me explain:

I was curious about CuPy and decided to compare it against NumPy via a very simple code that can be find below. While it works like a charm on NumPy - so on CPU and standard RAM - it totally crashes with CuPy: cupy.cuda.memory.OutOfMemoryError is thrown.

import numpy as np
import  cupy as cp
import  time

# Reminder: The ** operator means a square operation i.e. 10^9 = 1,000,000,000.
P = 9
N = 10**P

print(f"Scenario 1: Summing ranges of {N} random integers between 0 and 9.")
print("")

Lx = np.random.randint(0, 10, size=N)
print(Lx.nbytes/1024/1024)

T0 = time.perf_counter()
np.sum(Lx)
T1 = time.perf_counter()

del Lx

print("CPU computation done in {:5,.0f} ms.".format((T1-T0)*1000))

Lx = cp.random.randint(0, 10, size=N)
print(Lx.nbytes/1024/1024)

T0 = time.perf_counter()
cp.sum(Lx)
T1 = time.perf_counter()

del Lx

print("GPU computation done in {:5,.0f} ms.".format((T1-T0)*1000))

This small code gives the following output:

Scenario 1: Summing ranges of 1000000000 random integers between 0 and 1.

3814.697265625
CPU computation done in   585 ms.
Traceback (most recent call last):
  File "C:\APPLIS\nvidia\samples\operations.py", line 23, in <module>
    Lx = cp.random.randint(0, 10, size=N)
  File "C:\APPLIS\python\lib\site-packages\cupy\random\_sample.py", line 110, in randint
    return rs.randint(low, high, size, dtype)
  File "C:\APPLIS\python\lib\site-packages\cupy\random\_generator.py", line 1161, in randint
    x = self._interval(diff, size).astype(dtype, copy=False)
  File "C:\APPLIS\python\lib\site-packages\cupy\random\_generator.py", line 709, in _interval
    ret[n - n_rem:n - n_rem + n_succ] = sample[success][:n_succ]
  File "cupy\core\core.pyx", line 1228, in cupy.core.core.ndarray.__getitem__
  File "cupy\core\_routines_indexing.pyx", line 39, in cupy.core._routines_indexing._ndarray_getitem
  File "cupy\core\_routines_indexing.pyx", line 669, in cupy.core._routines_indexing._getitem_mask_single
  File "cupy\core\_routines_indexing.pyx", line 636, in cupy.core._routines_indexing._prepare_mask_indexing_single
  File "cupy\core\_routines_math.pyx", line 420, in cupy.core._routines_math.scan
  File "cupy\core\core.pyx", line 2392, in cupy.core.core._ndarray_init
  File "cupy\core\core.pyx", line 151, in cupy.core.core.ndarray._init_fast
  File "cupy\cuda\memory.pyx", line 578, in cupy.cuda.memory.alloc
  File "cupy\cuda\memory.pyx", line 1250, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy\cuda\memory.pyx", line 1271, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy\cuda\memory.pyx", line 939, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy\cuda\memory.pyx", line 959, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy\cuda\memory.pyx", line 1210, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 4,000,000,000 bytes (allocated so far: 9,000,001,024 bytes).

What that means is that the program failed during cp.random.randint(0, 10, size=N) because it missed memory.

Re-running it and keeping an eye on the CPU/RAM/GPU/RAM consumptions raised a weird thing:

NumPy consumes ~4GB as stated; nothing less, nothing more
CuPy consumes ~4GB over 4GB available on dedicated RAM …then starts consuming shared RAM up to 8GB which ends up in crashing as I have no more than 8GB standard RAM free for anything

That means CuPy’s RAM consumption requirements for a simple randint is more than 3X than NumPy.

Interestingly, I then restrained randint to cp.random.randint(0, 2, size=N) to see CuPy at least achieve something.

This gave:

Scenario 1: Summing ranges of 1000000000 random integers between 0 and 1.

3814.697265625
CPU computation done in   577 ms.
3814.697265625
GPU computation done in   398 ms.

Something’s fishy. Any clue?

Configuration:

OS                           : Windows-10-10.0.17134-SP0
CuPy Version                 : 8.5.0
NumPy Version                : 1.20.1
SciPy Version                : None
Cython Build Version         : 0.29.22
CUDA Root                    : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2
CUDA Build Version           : 11020
CUDA Driver Version          : 11030
CUDA Runtime Version         : 11020
cuBLAS Version               : 11401
cuFFT Version                : 10400
cuRAND Version               : 10203
cuSOLVER Version             : (11, 1, 0)
cuSPARSE Version             : 11400
NVRTC Version                : (11, 2)
Thrust Version               : 101000
CUB Build Version            : 101000
cuDNN Build Version          : None
cuDNN Version                : None
NCCL Build Version           : None
NCCL Runtime Version         : None
cuTENSOR Version             : None
Device 0 Name                : Quadro P620
Device 0 Compute Capability  : 61

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:12 (7 by maintainers)

Top GitHub Comments

1reaction

willemavjccommented, Mar 12, 2021

Another thing I forgot to ask: You mentioned CuPy saturates 8GB CPU memory – How did you monitor it? Almost all CuPy array creation routines create arrays directly on device (GPU) without going through host (CPU) at all, meaning there’s no device-host transfer, so perhaps this is where we are confused.

I think I know what happened here, you have one of these mobile GPUs that when they run out of allocated memory, windows starts to give them a portion of main memory to use. That’s why you see the host memory growing up to a certain point. This is because in CuPy randint does a loop where it does multiple memory allocations (a sample & reject loop) so for a 4GB array, just doing some of these additional allocations can blow all the available memory, this is a CuPy issue that we know and we are aware of, and you can work around it by using the new random API generator as I suggested above.

I think this is what is happening.

How am I monitoring this: I open up the Windows Task Manager and have a look on CPU (i5 8-core) + RAM (16GB) + GPU graphics ; the latter is divided in 2, one for the GPU itself and another for the GDDR (4GB).

When I start this small test, CPU is under 10% and RAM at 50% (8GB).

NumPy then starts: CPU is growing a bit but not too much, RAM is consumed for around 4GB. (75% total now consumed) NumPy finishes and RAM goes down to 8GB (50%); del hence worked as expected.

Then is starting CuPy:

GPU is growing but not too much
GDDR rapidly increases to 4GB ; no RAM is consumed (still at 50%: 8GB)
Once GDDR saturated: RAM starts growing, GB per GB until RAM saturation (100% meaning +8GB hence my statement on this 12GB total consumption)
Failure; everything is freed: GDDR (4GB) +RAM (8GB) ; we’re back to normal with 50% RAM available and GDDR 100% available

this is a CuPy issue that we know and we are aware of, and you can work around it by using the new random API generator as I suggested above.

I am eager to test it but CuPy 9 is still in beta right?

1reaction

emcastillocommented, Mar 12, 2021

I think I know what happened here, you have one of these mobile GPUs that when they run out of allocated memory, windows starts to give them a portion of main memory to use. That’s why you see the host memory growing up to a certain point. This is because in CuPy randint does a loop where it does multiple memory allocations (a sample & reject loop) so for a 4GB array, just doing some of these additional allocations can blow all the available memory, this is a CuPy issue that we know and we are aware of, and you can work around it by using the new random API generator as I suggested above.

Top Results From Across the Web

Memory Management — CuPy 11.4.0 documentation

CuPy uses memory pool for memory allocations by default. The memory pool significantly improves the performance by mitigating the overhead of memory allocation ......

Memory leaking when using GPU acceleration of filtering and ...

MNE-Python version: 0.23.0 - operating system: Ubuntu Hi, I'm using CUDA to accelerate the filtering. However, I found that the memory wouldn't release...

CuPy Documentation - Read the Docs

by memory usage). If the size is set to 0, the cache is disabled. Please refer to its documentation for more detail.

How to fully release GPU memory used in function

CuPy uses Python's reference counter to track which arrays are in use. In this case, you should del arr_gpu before calling free_all_blocks ...

How to debug causes of GPU memory leaks? - PyTorch Forums

Unable to allocate cuda memory, when there is enough of cached memory. Phantom PyTorch Data on GPU. CPU memory usage leak because of...