CuPy memory leak? (Outstanding consumption.)
See original GitHub issueHello.
I am wondering how is managed the memory through CuPy and if there could be an issue in the way it is handled. Let me explain:
I was curious about CuPy and decided to compare it against NumPy via a very simple code that can be find below.
While it works like a charm on NumPy - so on CPU and standard RAM - it totally crashes with CuPy: cupy.cuda.memory.OutOfMemoryError
is thrown.
import numpy as np
import cupy as cp
import time
# Reminder: The ** operator means a square operation i.e. 10^9 = 1,000,000,000.
P = 9
N = 10**P
print(f"Scenario 1: Summing ranges of {N} random integers between 0 and 9.")
print("")
Lx = np.random.randint(0, 10, size=N)
print(Lx.nbytes/1024/1024)
T0 = time.perf_counter()
np.sum(Lx)
T1 = time.perf_counter()
del Lx
print("CPU computation done in {:5,.0f} ms.".format((T1-T0)*1000))
Lx = cp.random.randint(0, 10, size=N)
print(Lx.nbytes/1024/1024)
T0 = time.perf_counter()
cp.sum(Lx)
T1 = time.perf_counter()
del Lx
print("GPU computation done in {:5,.0f} ms.".format((T1-T0)*1000))
This small code gives the following output:
Scenario 1: Summing ranges of 1000000000 random integers between 0 and 1.
3814.697265625
CPU computation done in 585 ms.
Traceback (most recent call last):
File "C:\APPLIS\nvidia\samples\operations.py", line 23, in <module>
Lx = cp.random.randint(0, 10, size=N)
File "C:\APPLIS\python\lib\site-packages\cupy\random\_sample.py", line 110, in randint
return rs.randint(low, high, size, dtype)
File "C:\APPLIS\python\lib\site-packages\cupy\random\_generator.py", line 1161, in randint
x = self._interval(diff, size).astype(dtype, copy=False)
File "C:\APPLIS\python\lib\site-packages\cupy\random\_generator.py", line 709, in _interval
ret[n - n_rem:n - n_rem + n_succ] = sample[success][:n_succ]
File "cupy\core\core.pyx", line 1228, in cupy.core.core.ndarray.__getitem__
File "cupy\core\_routines_indexing.pyx", line 39, in cupy.core._routines_indexing._ndarray_getitem
File "cupy\core\_routines_indexing.pyx", line 669, in cupy.core._routines_indexing._getitem_mask_single
File "cupy\core\_routines_indexing.pyx", line 636, in cupy.core._routines_indexing._prepare_mask_indexing_single
File "cupy\core\_routines_math.pyx", line 420, in cupy.core._routines_math.scan
File "cupy\core\core.pyx", line 2392, in cupy.core.core._ndarray_init
File "cupy\core\core.pyx", line 151, in cupy.core.core.ndarray._init_fast
File "cupy\cuda\memory.pyx", line 578, in cupy.cuda.memory.alloc
File "cupy\cuda\memory.pyx", line 1250, in cupy.cuda.memory.MemoryPool.malloc
File "cupy\cuda\memory.pyx", line 1271, in cupy.cuda.memory.MemoryPool.malloc
File "cupy\cuda\memory.pyx", line 939, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
File "cupy\cuda\memory.pyx", line 959, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
File "cupy\cuda\memory.pyx", line 1210, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 4,000,000,000 bytes (allocated so far: 9,000,001,024 bytes).
What that means is that the program failed during cp.random.randint(0, 10, size=N)
because it missed memory.
Re-running it and keeping an eye on the CPU/RAM/GPU/RAM consumptions raised a weird thing:
- NumPy consumes ~4GB as stated; nothing less, nothing more
- CuPy consumes ~4GB over 4GB available on dedicated RAM …then starts consuming shared RAM up to 8GB which ends up in crashing as I have no more than 8GB standard RAM free for anything
That means CuPy’s RAM consumption requirements for a simple randint
is more than 3X than NumPy.
Interestingly, I then restrained randint
to cp.random.randint(0, 2, size=N)
to see CuPy at least achieve something.
This gave:
Scenario 1: Summing ranges of 1000000000 random integers between 0 and 1.
3814.697265625
CPU computation done in 577 ms.
3814.697265625
GPU computation done in 398 ms.
Something’s fishy. Any clue?
Configuration:
OS : Windows-10-10.0.17134-SP0
CuPy Version : 8.5.0
NumPy Version : 1.20.1
SciPy Version : None
Cython Build Version : 0.29.22
CUDA Root : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2
CUDA Build Version : 11020
CUDA Driver Version : 11030
CUDA Runtime Version : 11020
cuBLAS Version : 11401
cuFFT Version : 10400
cuRAND Version : 10203
cuSOLVER Version : (11, 1, 0)
cuSPARSE Version : 11400
NVRTC Version : (11, 2)
Thrust Version : 101000
CUB Build Version : 101000
cuDNN Build Version : None
cuDNN Version : None
NCCL Build Version : None
NCCL Runtime Version : None
cuTENSOR Version : None
Device 0 Name : Quadro P620
Device 0 Compute Capability : 61
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:12 (7 by maintainers)
Top GitHub Comments
I think this is what is happening.
How am I monitoring this: I open up the Windows Task Manager and have a look on CPU (i5 8-core) + RAM (16GB) + GPU graphics ; the latter is divided in 2, one for the GPU itself and another for the GDDR (4GB).
When I start this small test, CPU is under 10% and RAM at 50% (8GB).
NumPy then starts: CPU is growing a bit but not too much, RAM is consumed for around 4GB. (75% total now consumed) NumPy finishes and RAM goes down to 8GB (50%);
del
hence worked as expected.Then is starting CuPy:
I am eager to test it but CuPy 9 is still in beta right?
I think I know what happened here, you have one of these mobile GPUs that when they run out of allocated memory, windows starts to give them a portion of main memory to use. That’s why you see the host memory growing up to a certain point. This is because in CuPy randint does a loop where it does multiple memory allocations (a sample & reject loop) so for a 4GB array, just doing some of these additional allocations can blow all the available memory, this is a CuPy issue that we know and we are aware of, and you can work around it by using the new random API generator as I suggested above.