Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA: Local memory kernel launch causes excessive memory allocation

See original GitHub issue

First reported on Gitter and in https://numba.discourse.group/t/bug-numba-using-a-lot-of-gpu-memory/255, the following:

import numba as nb
from numba import cuda


@cuda.jit("void(int32[:])", device=True)
def GPU_device_function(arr):
    return


@cuda.jit("void()")
def GPU_entry_point():
    # When this if is removed, it works normally
    if cuda.grid(1):
        return

    # Should use only 256 KB of memory.
    arr = cuda.local.array(shape=65536, dtype=nb.int32)

    # When this assigment is removed, it works normally
    arr[0] = 0

    # When this call is removed, it works normally
    GPU_device_function(arr)

######################


if __name__ == '__main__':
    print(cuda.select_device(0))

    ctx = cuda.current_context()
    print("\nInitial memory info:")
    print(ctx.get_memory_info())

    GPU_entry_point[1, 1]()
    print("\nAfter kernel launch:")
    print(ctx.get_memory_info())

    del GPU_entry_point
    ctx.deallocations.clear()
    print("\nAfter deleting function and clearing deallocations:")
    print(ctx.get_memory_info())

    ctx.reset()
    print("\nAfter resetting context:")
    print(ctx.get_memory_info())

    cuda.driver.driver.reset()
    ctx = cuda.current_context()
    print("\nAfter resetting driver:")
    print(ctx.get_memory_info())

produces the following output:

$ python repro.py 
<CUDA device 0 'b'Quadro RTX 8000''>

Initial memory info:
MemoryInfo(free=50777096192, total=50962169856)

After kernel launch:
MemoryInfo(free=31525240832, total=50962169856)

After deleting function and clearing deallocations:
MemoryInfo(free=31525240832, total=50962169856)

After resetting context:
MemoryInfo(free=31527337984, total=50962169856)

After resetting driver:
MemoryInfo(free=50779193344, total=50962169856)

i.e. launching the kernel consumes a large amount of memory (>18GB in this case). Deleting the function and resetting the context has no effect on the memory usage. The memory is only reclaimed when the driver is reset. The memory usage also correlates with output from nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.28       Driver Version: 455.28       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     Off  | 00000000:15:00.0 Off |                  Off |
| 33%   30C    P2    64W / 260W |  18534MiB / 48601MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro P2200        Off  | 00000000:2D:00.0  On |                  N/A |
| 45%   33C    P8     5W /  75W |    370MiB /  5049MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1091      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2158      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A     11452      C   python                          18521MiB |
|    1   N/A  N/A      1091      G   /usr/lib/xorg/Xorg                 71MiB |
|    1   N/A  N/A      1695      G   /usr/bin/gnome-shell               54MiB |
|    1   N/A  N/A      2158      G   /usr/lib/xorg/Xorg                188MiB |
|    1   N/A  N/A      2475      G   /usr/lib/firefox/firefox            1MiB |
|    1   N/A  N/A      2480      G   /usr/lib/firefox/firefox           47MiB |
+-----------------------------------------------------------------------------+

The debug log confirms that the module is unloaded when the context is reset:

$ NUMBA_CUDA_LOG_LEVEL=DEBUG python repro.py 
== CUDA [262]  INFO -- init
== CUDA [262] DEBUG -- call driver api: cuInit(0)
== CUDA [301] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f0c7a4c90)>)
== CUDA [301] DEBUG -- call driver api: cuDeviceGetCount(<cparam 'P' (0x7f0f0c7a4c90)>)
== CUDA [301] DEBUG -- call driver api: cuDeviceGet(<cparam 'P' (0x7f0f0c7a4c90)>, 0)
== CUDA [301] DEBUG -- call driver api: cuDeviceComputeCapability(<cparam 'P' (0x7f0f0c7a4390)>, <cparam 'P' (0x7f0f0c7a4a10)>, 0)
== CUDA [301] DEBUG -- call driver api: cuDeviceGetName(<numba.cuda.cudadrv.driver.c_char_Array_128 object at 0x7f0f0c7a4940>, 128, 0)
== CUDA [301] DEBUG -- call driver api: cuDeviceGet(<cparam 'P' (0x7f0f0c7a4990)>, 1)
== CUDA [301] DEBUG -- call driver api: cuDeviceComputeCapability(<cparam 'P' (0x7f0f0c7a4c10)>, <cparam 'P' (0x7f0f0c7a4a10)>, 1)
== CUDA [301] DEBUG -- call driver api: cuDeviceGetName(<numba.cuda.cudadrv.driver.c_char_Array_128 object at 0x7f0f0c7a4340>, 128, 1)
== CUDA [301] DEBUG -- call driver api: cuDevicePrimaryCtxRetain(<cparam 'P' (0x7f0f0c7a4390)>, 0)
== CUDA [403] DEBUG -- call driver api: cuCtxPushCurrent_v2(c_void_p(94725512477600))
== CUDA [403] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f0c7a4910)>, <cparam 'P' (0x7f0f0c7a4a90)>)
== CUDA [433] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f04ad8e10)>)
== CUDA [433] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f0f04ad8f90)>)
== CUDA [433] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f04ad8e10)>)
== CUDA [433] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f0f04ad8f90)>)
== CUDA [436] DEBUG -- call runtime api: cudaRuntimeGetVersion
== CUDA [477] DEBUG -- call driver api: cuLinkCreate_v2(6, <numba.cuda.cudadrv.driver.c_int_Array_6 object at 0x7f0f049f4ac0>, <numba.cuda.cudadrv.driver.c_void_p_Array_5 object at 0x7f0f049f44c0>, <cparam 'P' (0x7f0f049f4c10)>)
== CUDA [477] DEBUG -- call driver api: cuLinkAddData_v2(c_void_p(94725521483008), 1, c_char_p(94725522258640), 1471, c_char_p(139702479087040), 0, None, None)
== CUDA [477] DEBUG -- call driver api: cuLinkComplete(c_void_p(94725521483008), <cparam 'P' (0x7f0f0c7a4910)>, <cparam 'P' (0x7f0f0c7a4c90)>)
== CUDA [478] DEBUG -- call driver api: cuModuleLoadDataEx(<cparam 'P' (0x7f0f049f4210)>, c_void_p(94725521446744), 5, <numba.cuda.cudadrv.driver.c_int_Array_5 object at 0x7f0f049f44c0>, <numba.cuda.cudadrv.driver.c_void_p_Array_5 object at 0x7f0f049f4140>)
== CUDA [478] DEBUG -- call driver api: cuModuleGetFunction(<cparam 'P' (0x7f0f04ad8c10)>, c_void_p(94725521453504), b'_ZN6cudapy8__main__19GPU_entry_point$242E')
== CUDA [478] DEBUG -- call driver api: cuFuncGetAttribute(<cparam 'P' (0x7f0f0c7a4c90)>, 4, c_void_p(94725526521360))
== CUDA [478] DEBUG -- call driver api: cuFuncGetAttribute(<cparam 'P' (0x7f0f0c7a4c90)>, 2, c_void_p(94725526521360))
== CUDA [478] DEBUG -- call driver api: cuFuncGetAttribute(<cparam 'P' (0x7f0f0c7a4c90)>, 3, c_void_p(94725526521360))
== CUDA [478] DEBUG -- call driver api: cuFuncGetAttribute(<cparam 'P' (0x7f0f0c7a4c90)>, 1, c_void_p(94725526521360))
== CUDA [478] DEBUG -- call driver api: cuFuncGetAttribute(<cparam 'P' (0x7f0f0c7a4c90)>, 0, c_void_p(94725526521360))
== CUDA [478] DEBUG -- call driver api: cuLinkDestroy(c_void_p(94725521483008))
== CUDA [478] DEBUG -- call driver api: cuCtxPushCurrent_v2(c_void_p(94725512477600))
<CUDA device 0 'b'Quadro RTX 8000''>
== CUDA [478] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f04ad8f90)>)
== CUDA [478] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f0f04ad8e10)>)

Initial memory info:
== CUDA [479] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f04ad8f90)>, <cparam 'P' (0x7f0f04ad8e10)>)
MemoryInfo(free=50777096192, total=50962169856)
== CUDA [479] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f04ad8e10)>)
== CUDA [479] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f0f04ad8f90)>)
== CUDA [479] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f04ad8e10)>)
== CUDA [479] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f0f04ad8f90)>)
== CUDA [479] DEBUG -- call driver api: cuLaunchKernel(c_void_p(94725526521360), 1, 1, 1, 1, 1, 1, 0, None, <numba.cuda.cudadrv.driver.c_void_p_Array_0 object at 0x7f0f04ad8f40>, None)

After kernel launch:
== CUDA [498] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f04ad8f90)>, <cparam 'P' (0x7f0f0c7a4790)>)
MemoryInfo(free=31525240832, total=50962169856)

After deleting function and clearing deallocations:
== CUDA [498] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f04ad8c10)>, <cparam 'P' (0x7f0f04ad8d10)>)
MemoryInfo(free=31525240832, total=50962169856)
== CUDA [498]  INFO -- reset context of device 0
== CUDA [498]  INFO -- add pending dealloc: module_unload ? bytes
== CUDA [498]  INFO -- dealloc: module_unload ? bytes
== CUDA [498] DEBUG -- call driver api: cuModuleUnload(c_void_p(94725521453504))

After resetting context:
== CUDA [501] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f04ad8d10)>, <cparam 'P' (0x7f0f04ad8c10)>)
MemoryInfo(free=31527337984, total=50962169856)
== CUDA [501]  INFO -- reset context of device 0
== CUDA [502] DEBUG -- call driver api: cuDevicePrimaryCtxRelease_v2(0)
== CUDA [592] DEBUG -- call driver api: cuDevicePrimaryCtxReset_v2(0)
== CUDA [592] DEBUG -- call driver api: cuDevicePrimaryCtxReset_v2(1)
== CUDA [592] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f04ad8c10)>)
== CUDA [592] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f0f04ad8d10)>)
== CUDA [592] DEBUG -- call driver api: cuDevicePrimaryCtxRetain(<cparam 'P' (0x7f0f04ad8d10)>, 0)
== CUDA [715] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f04ad8c90)>, <cparam 'P' (0x7f0f04ad8f90)>)

After resetting driver:
== CUDA [715] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f04ad8c10)>, <cparam 'P' (0x7f0f04ad8f90)>)
MemoryInfo(free=50779193344, total=50962169856)

The PTX looks odd: it appears that all threads will write 0 to every element of the local array when executed:

.version 7.0
.target sm_75
.address_size 64

	// .globl	_ZN6cudapy8__main__19GPU_entry_point$242E
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__errcode__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__tidx__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__ctaidx__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__tidy__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__ctaidy__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__tidz__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__ctaidz__;
.common .global .align 8 .u64 _ZN08NumbaEnv8__main__19GPU_entry_point$242E;
.common .global .align 8 .u64 _ZN08NumbaEnv8__main__23GPU_device_function$241E5ArrayIiLi1E1A7mutable7alignedE;

.visible .entry _ZN6cudapy8__main__19GPU_entry_point$242E(

)
{
	.local .align 16 .b8 	__local_depot0[262144];
	.reg .b64 	%SP;
	.reg .b64 	%SPL;
	.reg .pred 	%p<2>;
	.reg .b32 	%r<6>;
	.reg .b64 	%rd<6>;


	mov.u64 	%SPL, __local_depot0;
	add.u64 	%rd5, %SPL, 0;
	mov.u32 	%r3, 0;
	mov.u32 	%r5, %r3;

BB0_1:
	st.local.u32 	[%rd5], %r3;
	add.s64 	%rd5, %rd5, 4;
	add.s32 	%r5, %r5, 1;
	setp.lt.u32	%p1, %r5, 65536;
	@%p1 bra 	BB0_1;

	ret;
}

Issue Analytics

State:
Created 3 years ago
Comments:9 (6 by maintainers)

Top GitHub Comments

2reactions

gamecraftCZcommented, Oct 12, 2020

Seems like memory usage scales linerary with the CUDA cores count.

GPU	cores	memory usage
Quadro RTX 8000	4,608 cores	18.5GB
GTX 1050	640 cores	2.7GB

0reactions

seibertcommented, Mar 22, 2021

In this case, the spike in virtual memory is side effect of initializing CUDA in the process. The driver sets up address mappings to enable unified memory addressing (which increases the process virtual memory usage) across all GPUs in the system, but they consume no physical or swap pages. Unfortunately, this makes virtual memory a useless metric for such processes, and Linux doesn’t have (AFAIK) a “physical + swap” metric.