CUDA: Local memory kernel launch causes excessive memory allocation
See original GitHub issueFirst reported on Gitter and in https://numba.discourse.group/t/bug-numba-using-a-lot-of-gpu-memory/255, the following:
import numba as nb
from numba import cuda
@cuda.jit("void(int32[:])", device=True)
def GPU_device_function(arr):
return
@cuda.jit("void()")
def GPU_entry_point():
# When this if is removed, it works normally
if cuda.grid(1):
return
# Should use only 256 KB of memory.
arr = cuda.local.array(shape=65536, dtype=nb.int32)
# When this assigment is removed, it works normally
arr[0] = 0
# When this call is removed, it works normally
GPU_device_function(arr)
######################
if __name__ == '__main__':
print(cuda.select_device(0))
ctx = cuda.current_context()
print("\nInitial memory info:")
print(ctx.get_memory_info())
GPU_entry_point[1, 1]()
print("\nAfter kernel launch:")
print(ctx.get_memory_info())
del GPU_entry_point
ctx.deallocations.clear()
print("\nAfter deleting function and clearing deallocations:")
print(ctx.get_memory_info())
ctx.reset()
print("\nAfter resetting context:")
print(ctx.get_memory_info())
cuda.driver.driver.reset()
ctx = cuda.current_context()
print("\nAfter resetting driver:")
print(ctx.get_memory_info())
produces the following output:
$ python repro.py
<CUDA device 0 'b'Quadro RTX 8000''>
Initial memory info:
MemoryInfo(free=50777096192, total=50962169856)
After kernel launch:
MemoryInfo(free=31525240832, total=50962169856)
After deleting function and clearing deallocations:
MemoryInfo(free=31525240832, total=50962169856)
After resetting context:
MemoryInfo(free=31527337984, total=50962169856)
After resetting driver:
MemoryInfo(free=50779193344, total=50962169856)
i.e. launching the kernel consumes a large amount of memory (>18GB in this case). Deleting the function and resetting the context has no effect on the memory usage. The memory is only reclaimed when the driver is reset. The memory usage also correlates with output from nvidia-smi
:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.28 Driver Version: 455.28 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 8000 Off | 00000000:15:00.0 Off | Off |
| 33% 30C P2 64W / 260W | 18534MiB / 48601MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Quadro P2200 Off | 00000000:2D:00.0 On | N/A |
| 45% 33C P8 5W / 75W | 370MiB / 5049MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1091 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2158 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 11452 C python 18521MiB |
| 1 N/A N/A 1091 G /usr/lib/xorg/Xorg 71MiB |
| 1 N/A N/A 1695 G /usr/bin/gnome-shell 54MiB |
| 1 N/A N/A 2158 G /usr/lib/xorg/Xorg 188MiB |
| 1 N/A N/A 2475 G /usr/lib/firefox/firefox 1MiB |
| 1 N/A N/A 2480 G /usr/lib/firefox/firefox 47MiB |
+-----------------------------------------------------------------------------+
The debug log confirms that the module is unloaded when the context is reset:
$ NUMBA_CUDA_LOG_LEVEL=DEBUG python repro.py
== CUDA [262] INFO -- init
== CUDA [262] DEBUG -- call driver api: cuInit(0)
== CUDA [301] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f0c7a4c90)>)
== CUDA [301] DEBUG -- call driver api: cuDeviceGetCount(<cparam 'P' (0x7f0f0c7a4c90)>)
== CUDA [301] DEBUG -- call driver api: cuDeviceGet(<cparam 'P' (0x7f0f0c7a4c90)>, 0)
== CUDA [301] DEBUG -- call driver api: cuDeviceComputeCapability(<cparam 'P' (0x7f0f0c7a4390)>, <cparam 'P' (0x7f0f0c7a4a10)>, 0)
== CUDA [301] DEBUG -- call driver api: cuDeviceGetName(<numba.cuda.cudadrv.driver.c_char_Array_128 object at 0x7f0f0c7a4940>, 128, 0)
== CUDA [301] DEBUG -- call driver api: cuDeviceGet(<cparam 'P' (0x7f0f0c7a4990)>, 1)
== CUDA [301] DEBUG -- call driver api: cuDeviceComputeCapability(<cparam 'P' (0x7f0f0c7a4c10)>, <cparam 'P' (0x7f0f0c7a4a10)>, 1)
== CUDA [301] DEBUG -- call driver api: cuDeviceGetName(<numba.cuda.cudadrv.driver.c_char_Array_128 object at 0x7f0f0c7a4340>, 128, 1)
== CUDA [301] DEBUG -- call driver api: cuDevicePrimaryCtxRetain(<cparam 'P' (0x7f0f0c7a4390)>, 0)
== CUDA [403] DEBUG -- call driver api: cuCtxPushCurrent_v2(c_void_p(94725512477600))
== CUDA [403] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f0c7a4910)>, <cparam 'P' (0x7f0f0c7a4a90)>)
== CUDA [433] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f04ad8e10)>)
== CUDA [433] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f0f04ad8f90)>)
== CUDA [433] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f04ad8e10)>)
== CUDA [433] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f0f04ad8f90)>)
== CUDA [436] DEBUG -- call runtime api: cudaRuntimeGetVersion
== CUDA [477] DEBUG -- call driver api: cuLinkCreate_v2(6, <numba.cuda.cudadrv.driver.c_int_Array_6 object at 0x7f0f049f4ac0>, <numba.cuda.cudadrv.driver.c_void_p_Array_5 object at 0x7f0f049f44c0>, <cparam 'P' (0x7f0f049f4c10)>)
== CUDA [477] DEBUG -- call driver api: cuLinkAddData_v2(c_void_p(94725521483008), 1, c_char_p(94725522258640), 1471, c_char_p(139702479087040), 0, None, None)
== CUDA [477] DEBUG -- call driver api: cuLinkComplete(c_void_p(94725521483008), <cparam 'P' (0x7f0f0c7a4910)>, <cparam 'P' (0x7f0f0c7a4c90)>)
== CUDA [478] DEBUG -- call driver api: cuModuleLoadDataEx(<cparam 'P' (0x7f0f049f4210)>, c_void_p(94725521446744), 5, <numba.cuda.cudadrv.driver.c_int_Array_5 object at 0x7f0f049f44c0>, <numba.cuda.cudadrv.driver.c_void_p_Array_5 object at 0x7f0f049f4140>)
== CUDA [478] DEBUG -- call driver api: cuModuleGetFunction(<cparam 'P' (0x7f0f04ad8c10)>, c_void_p(94725521453504), b'_ZN6cudapy8__main__19GPU_entry_point$242E')
== CUDA [478] DEBUG -- call driver api: cuFuncGetAttribute(<cparam 'P' (0x7f0f0c7a4c90)>, 4, c_void_p(94725526521360))
== CUDA [478] DEBUG -- call driver api: cuFuncGetAttribute(<cparam 'P' (0x7f0f0c7a4c90)>, 2, c_void_p(94725526521360))
== CUDA [478] DEBUG -- call driver api: cuFuncGetAttribute(<cparam 'P' (0x7f0f0c7a4c90)>, 3, c_void_p(94725526521360))
== CUDA [478] DEBUG -- call driver api: cuFuncGetAttribute(<cparam 'P' (0x7f0f0c7a4c90)>, 1, c_void_p(94725526521360))
== CUDA [478] DEBUG -- call driver api: cuFuncGetAttribute(<cparam 'P' (0x7f0f0c7a4c90)>, 0, c_void_p(94725526521360))
== CUDA [478] DEBUG -- call driver api: cuLinkDestroy(c_void_p(94725521483008))
== CUDA [478] DEBUG -- call driver api: cuCtxPushCurrent_v2(c_void_p(94725512477600))
<CUDA device 0 'b'Quadro RTX 8000''>
== CUDA [478] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f04ad8f90)>)
== CUDA [478] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f0f04ad8e10)>)
Initial memory info:
== CUDA [479] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f04ad8f90)>, <cparam 'P' (0x7f0f04ad8e10)>)
MemoryInfo(free=50777096192, total=50962169856)
== CUDA [479] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f04ad8e10)>)
== CUDA [479] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f0f04ad8f90)>)
== CUDA [479] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f04ad8e10)>)
== CUDA [479] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f0f04ad8f90)>)
== CUDA [479] DEBUG -- call driver api: cuLaunchKernel(c_void_p(94725526521360), 1, 1, 1, 1, 1, 1, 0, None, <numba.cuda.cudadrv.driver.c_void_p_Array_0 object at 0x7f0f04ad8f40>, None)
After kernel launch:
== CUDA [498] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f04ad8f90)>, <cparam 'P' (0x7f0f0c7a4790)>)
MemoryInfo(free=31525240832, total=50962169856)
After deleting function and clearing deallocations:
== CUDA [498] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f04ad8c10)>, <cparam 'P' (0x7f0f04ad8d10)>)
MemoryInfo(free=31525240832, total=50962169856)
== CUDA [498] INFO -- reset context of device 0
== CUDA [498] INFO -- add pending dealloc: module_unload ? bytes
== CUDA [498] INFO -- dealloc: module_unload ? bytes
== CUDA [498] DEBUG -- call driver api: cuModuleUnload(c_void_p(94725521453504))
After resetting context:
== CUDA [501] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f04ad8d10)>, <cparam 'P' (0x7f0f04ad8c10)>)
MemoryInfo(free=31527337984, total=50962169856)
== CUDA [501] INFO -- reset context of device 0
== CUDA [502] DEBUG -- call driver api: cuDevicePrimaryCtxRelease_v2(0)
== CUDA [592] DEBUG -- call driver api: cuDevicePrimaryCtxReset_v2(0)
== CUDA [592] DEBUG -- call driver api: cuDevicePrimaryCtxReset_v2(1)
== CUDA [592] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f04ad8c10)>)
== CUDA [592] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f0f04ad8d10)>)
== CUDA [592] DEBUG -- call driver api: cuDevicePrimaryCtxRetain(<cparam 'P' (0x7f0f04ad8d10)>, 0)
== CUDA [715] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f04ad8c90)>, <cparam 'P' (0x7f0f04ad8f90)>)
After resetting driver:
== CUDA [715] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f04ad8c10)>, <cparam 'P' (0x7f0f04ad8f90)>)
MemoryInfo(free=50779193344, total=50962169856)
The PTX looks odd: it appears that all threads will write 0
to every element of the local array when executed:
.version 7.0
.target sm_75
.address_size 64
// .globl _ZN6cudapy8__main__19GPU_entry_point$242E
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__errcode__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__tidx__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__ctaidx__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__tidy__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__ctaidy__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__tidz__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__ctaidz__;
.common .global .align 8 .u64 _ZN08NumbaEnv8__main__19GPU_entry_point$242E;
.common .global .align 8 .u64 _ZN08NumbaEnv8__main__23GPU_device_function$241E5ArrayIiLi1E1A7mutable7alignedE;
.visible .entry _ZN6cudapy8__main__19GPU_entry_point$242E(
)
{
.local .align 16 .b8 __local_depot0[262144];
.reg .b64 %SP;
.reg .b64 %SPL;
.reg .pred %p<2>;
.reg .b32 %r<6>;
.reg .b64 %rd<6>;
mov.u64 %SPL, __local_depot0;
add.u64 %rd5, %SPL, 0;
mov.u32 %r3, 0;
mov.u32 %r5, %r3;
BB0_1:
st.local.u32 [%rd5], %r3;
add.s64 %rd5, %rd5, 4;
add.s32 %r5, %r5, 1;
setp.lt.u32 %p1, %r5, 65536;
@%p1 bra BB0_1;
ret;
}
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (6 by maintainers)
Top Results From Across the Web
Why does a simple single-threaded CUDA kernel consume ...
Briefly: Why does a simple single-threaded kernel that uses 40 KB of local memory consume 450 MB of global memory during execution?
Read more >Question about memory allocation in CUDA kernel
Local memory is declared inside the kernel function, and works exactly like declaring a local variable in C. You can only use local...
Read more >“CUDA Tutorial” - Jonathan Hui blog
Asynchronous concurrent execution of kernels and memory transfer using streaming. With mapped memory, it eliminates the need of allocate a block ...
Read more >Introduction to CUDA - Brown CS
Most CUDA programs operate like... ○ Copy some data to Device memory. ○ Kernel launch (run kernel function on some data). ○ Wait...
Read more >Memory management - Numba
Deallocation of all CUDA resources are tracked on a per-context basis. When the last reference to a device memory is dropped, the underlying...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Seems like memory usage scales linerary with the CUDA cores count.
In this case, the spike in virtual memory is side effect of initializing CUDA in the process. The driver sets up address mappings to enable unified memory addressing (which increases the process virtual memory usage) across all GPUs in the system, but they consume no physical or swap pages. Unfortunately, this makes virtual memory a useless metric for such processes, and Linux doesn’t have (AFAIK) a “physical + swap” metric.