question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA: Local memory kernel launch causes excessive memory allocation

See original GitHub issue

First reported on Gitter and in https://numba.discourse.group/t/bug-numba-using-a-lot-of-gpu-memory/255, the following:

import numba as nb
from numba import cuda


@cuda.jit("void(int32[:])", device=True)
def GPU_device_function(arr):
    return


@cuda.jit("void()")
def GPU_entry_point():
    # When this if is removed, it works normally
    if cuda.grid(1):
        return

    # Should use only 256 KB of memory.
    arr = cuda.local.array(shape=65536, dtype=nb.int32)

    # When this assigment is removed, it works normally
    arr[0] = 0

    # When this call is removed, it works normally
    GPU_device_function(arr)

######################


if __name__ == '__main__':
    print(cuda.select_device(0))

    ctx = cuda.current_context()
    print("\nInitial memory info:")
    print(ctx.get_memory_info())

    GPU_entry_point[1, 1]()
    print("\nAfter kernel launch:")
    print(ctx.get_memory_info())

    del GPU_entry_point
    ctx.deallocations.clear()
    print("\nAfter deleting function and clearing deallocations:")
    print(ctx.get_memory_info())

    ctx.reset()
    print("\nAfter resetting context:")
    print(ctx.get_memory_info())

    cuda.driver.driver.reset()
    ctx = cuda.current_context()
    print("\nAfter resetting driver:")
    print(ctx.get_memory_info())

produces the following output:

$ python repro.py 
<CUDA device 0 'b'Quadro RTX 8000''>

Initial memory info:
MemoryInfo(free=50777096192, total=50962169856)

After kernel launch:
MemoryInfo(free=31525240832, total=50962169856)

After deleting function and clearing deallocations:
MemoryInfo(free=31525240832, total=50962169856)

After resetting context:
MemoryInfo(free=31527337984, total=50962169856)

After resetting driver:
MemoryInfo(free=50779193344, total=50962169856)

i.e. launching the kernel consumes a large amount of memory (>18GB in this case). Deleting the function and resetting the context has no effect on the memory usage. The memory is only reclaimed when the driver is reset. The memory usage also correlates with output from nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.28       Driver Version: 455.28       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     Off  | 00000000:15:00.0 Off |                  Off |
| 33%   30C    P2    64W / 260W |  18534MiB / 48601MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro P2200        Off  | 00000000:2D:00.0  On |                  N/A |
| 45%   33C    P8     5W /  75W |    370MiB /  5049MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1091      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2158      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A     11452      C   python                          18521MiB |
|    1   N/A  N/A      1091      G   /usr/lib/xorg/Xorg                 71MiB |
|    1   N/A  N/A      1695      G   /usr/bin/gnome-shell               54MiB |
|    1   N/A  N/A      2158      G   /usr/lib/xorg/Xorg                188MiB |
|    1   N/A  N/A      2475      G   /usr/lib/firefox/firefox            1MiB |
|    1   N/A  N/A      2480      G   /usr/lib/firefox/firefox           47MiB |
+-----------------------------------------------------------------------------+

The debug log confirms that the module is unloaded when the context is reset:

$ NUMBA_CUDA_LOG_LEVEL=DEBUG python repro.py 
== CUDA [262]  INFO -- init
== CUDA [262] DEBUG -- call driver api: cuInit(0)
== CUDA [301] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f0c7a4c90)>)
== CUDA [301] DEBUG -- call driver api: cuDeviceGetCount(<cparam 'P' (0x7f0f0c7a4c90)>)
== CUDA [301] DEBUG -- call driver api: cuDeviceGet(<cparam 'P' (0x7f0f0c7a4c90)>, 0)
== CUDA [301] DEBUG -- call driver api: cuDeviceComputeCapability(<cparam 'P' (0x7f0f0c7a4390)>, <cparam 'P' (0x7f0f0c7a4a10)>, 0)
== CUDA [301] DEBUG -- call driver api: cuDeviceGetName(<numba.cuda.cudadrv.driver.c_char_Array_128 object at 0x7f0f0c7a4940>, 128, 0)
== CUDA [301] DEBUG -- call driver api: cuDeviceGet(<cparam 'P' (0x7f0f0c7a4990)>, 1)
== CUDA [301] DEBUG -- call driver api: cuDeviceComputeCapability(<cparam 'P' (0x7f0f0c7a4c10)>, <cparam 'P' (0x7f0f0c7a4a10)>, 1)
== CUDA [301] DEBUG -- call driver api: cuDeviceGetName(<numba.cuda.cudadrv.driver.c_char_Array_128 object at 0x7f0f0c7a4340>, 128, 1)
== CUDA [301] DEBUG -- call driver api: cuDevicePrimaryCtxRetain(<cparam 'P' (0x7f0f0c7a4390)>, 0)
== CUDA [403] DEBUG -- call driver api: cuCtxPushCurrent_v2(c_void_p(94725512477600))
== CUDA [403] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f0c7a4910)>, <cparam 'P' (0x7f0f0c7a4a90)>)
== CUDA [433] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f04ad8e10)>)
== CUDA [433] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f0f04ad8f90)>)
== CUDA [433] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f04ad8e10)>)
== CUDA [433] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f0f04ad8f90)>)
== CUDA [436] DEBUG -- call runtime api: cudaRuntimeGetVersion
== CUDA [477] DEBUG -- call driver api: cuLinkCreate_v2(6, <numba.cuda.cudadrv.driver.c_int_Array_6 object at 0x7f0f049f4ac0>, <numba.cuda.cudadrv.driver.c_void_p_Array_5 object at 0x7f0f049f44c0>, <cparam 'P' (0x7f0f049f4c10)>)
== CUDA [477] DEBUG -- call driver api: cuLinkAddData_v2(c_void_p(94725521483008), 1, c_char_p(94725522258640), 1471, c_char_p(139702479087040), 0, None, None)
== CUDA [477] DEBUG -- call driver api: cuLinkComplete(c_void_p(94725521483008), <cparam 'P' (0x7f0f0c7a4910)>, <cparam 'P' (0x7f0f0c7a4c90)>)
== CUDA [478] DEBUG -- call driver api: cuModuleLoadDataEx(<cparam 'P' (0x7f0f049f4210)>, c_void_p(94725521446744), 5, <numba.cuda.cudadrv.driver.c_int_Array_5 object at 0x7f0f049f44c0>, <numba.cuda.cudadrv.driver.c_void_p_Array_5 object at 0x7f0f049f4140>)
== CUDA [478] DEBUG -- call driver api: cuModuleGetFunction(<cparam 'P' (0x7f0f04ad8c10)>, c_void_p(94725521453504), b'_ZN6cudapy8__main__19GPU_entry_point$242E')
== CUDA [478] DEBUG -- call driver api: cuFuncGetAttribute(<cparam 'P' (0x7f0f0c7a4c90)>, 4, c_void_p(94725526521360))
== CUDA [478] DEBUG -- call driver api: cuFuncGetAttribute(<cparam 'P' (0x7f0f0c7a4c90)>, 2, c_void_p(94725526521360))
== CUDA [478] DEBUG -- call driver api: cuFuncGetAttribute(<cparam 'P' (0x7f0f0c7a4c90)>, 3, c_void_p(94725526521360))
== CUDA [478] DEBUG -- call driver api: cuFuncGetAttribute(<cparam 'P' (0x7f0f0c7a4c90)>, 1, c_void_p(94725526521360))
== CUDA [478] DEBUG -- call driver api: cuFuncGetAttribute(<cparam 'P' (0x7f0f0c7a4c90)>, 0, c_void_p(94725526521360))
== CUDA [478] DEBUG -- call driver api: cuLinkDestroy(c_void_p(94725521483008))
== CUDA [478] DEBUG -- call driver api: cuCtxPushCurrent_v2(c_void_p(94725512477600))
<CUDA device 0 'b'Quadro RTX 8000''>
== CUDA [478] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f04ad8f90)>)
== CUDA [478] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f0f04ad8e10)>)

Initial memory info:
== CUDA [479] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f04ad8f90)>, <cparam 'P' (0x7f0f04ad8e10)>)
MemoryInfo(free=50777096192, total=50962169856)
== CUDA [479] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f04ad8e10)>)
== CUDA [479] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f0f04ad8f90)>)
== CUDA [479] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f04ad8e10)>)
== CUDA [479] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f0f04ad8f90)>)
== CUDA [479] DEBUG -- call driver api: cuLaunchKernel(c_void_p(94725526521360), 1, 1, 1, 1, 1, 1, 0, None, <numba.cuda.cudadrv.driver.c_void_p_Array_0 object at 0x7f0f04ad8f40>, None)

After kernel launch:
== CUDA [498] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f04ad8f90)>, <cparam 'P' (0x7f0f0c7a4790)>)
MemoryInfo(free=31525240832, total=50962169856)

After deleting function and clearing deallocations:
== CUDA [498] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f04ad8c10)>, <cparam 'P' (0x7f0f04ad8d10)>)
MemoryInfo(free=31525240832, total=50962169856)
== CUDA [498]  INFO -- reset context of device 0
== CUDA [498]  INFO -- add pending dealloc: module_unload ? bytes
== CUDA [498]  INFO -- dealloc: module_unload ? bytes
== CUDA [498] DEBUG -- call driver api: cuModuleUnload(c_void_p(94725521453504))

After resetting context:
== CUDA [501] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f04ad8d10)>, <cparam 'P' (0x7f0f04ad8c10)>)
MemoryInfo(free=31527337984, total=50962169856)
== CUDA [501]  INFO -- reset context of device 0
== CUDA [502] DEBUG -- call driver api: cuDevicePrimaryCtxRelease_v2(0)
== CUDA [592] DEBUG -- call driver api: cuDevicePrimaryCtxReset_v2(0)
== CUDA [592] DEBUG -- call driver api: cuDevicePrimaryCtxReset_v2(1)
== CUDA [592] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f0f04ad8c10)>)
== CUDA [592] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f0f04ad8d10)>)
== CUDA [592] DEBUG -- call driver api: cuDevicePrimaryCtxRetain(<cparam 'P' (0x7f0f04ad8d10)>, 0)
== CUDA [715] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f04ad8c90)>, <cparam 'P' (0x7f0f04ad8f90)>)

After resetting driver:
== CUDA [715] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f0f04ad8c10)>, <cparam 'P' (0x7f0f04ad8f90)>)
MemoryInfo(free=50779193344, total=50962169856)

The PTX looks odd: it appears that all threads will write 0 to every element of the local array when executed:

.version 7.0
.target sm_75
.address_size 64

	// .globl	_ZN6cudapy8__main__19GPU_entry_point$242E
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__errcode__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__tidx__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__ctaidx__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__tidy__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__ctaidy__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__tidz__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19GPU_entry_point$242E__ctaidz__;
.common .global .align 8 .u64 _ZN08NumbaEnv8__main__19GPU_entry_point$242E;
.common .global .align 8 .u64 _ZN08NumbaEnv8__main__23GPU_device_function$241E5ArrayIiLi1E1A7mutable7alignedE;

.visible .entry _ZN6cudapy8__main__19GPU_entry_point$242E(

)
{
	.local .align 16 .b8 	__local_depot0[262144];
	.reg .b64 	%SP;
	.reg .b64 	%SPL;
	.reg .pred 	%p<2>;
	.reg .b32 	%r<6>;
	.reg .b64 	%rd<6>;


	mov.u64 	%SPL, __local_depot0;
	add.u64 	%rd5, %SPL, 0;
	mov.u32 	%r3, 0;
	mov.u32 	%r5, %r3;

BB0_1:
	st.local.u32 	[%rd5], %r3;
	add.s64 	%rd5, %rd5, 4;
	add.s32 	%r5, %r5, 1;
	setp.lt.u32	%p1, %r5, 65536;
	@%p1 bra 	BB0_1;

	ret;
}

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
gamecraftCZcommented, Oct 12, 2020

Seems like memory usage scales linerary with the CUDA cores count.

GPU cores memory usage
Quadro RTX 8000 4,608 cores 18.5GB
GTX 1050 640 cores 2.7GB
0reactions
seibertcommented, Mar 22, 2021

In this case, the spike in virtual memory is side effect of initializing CUDA in the process. The driver sets up address mappings to enable unified memory addressing (which increases the process virtual memory usage) across all GPUs in the system, but they consume no physical or swap pages. Unfortunately, this makes virtual memory a useless metric for such processes, and Linux doesn’t have (AFAIK) a “physical + swap” metric.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why does a simple single-threaded CUDA kernel consume ...
Briefly: Why does a simple single-threaded kernel that uses 40 KB of local memory consume 450 MB of global memory during execution?
Read more >
Question about memory allocation in CUDA kernel
Local memory is declared inside the kernel function, and works exactly like declaring a local variable in C. You can only use local...
Read more >
“CUDA Tutorial” - Jonathan Hui blog
Asynchronous concurrent execution of kernels and memory transfer using streaming. With mapped memory, it eliminates the need of allocate a block ...
Read more >
Introduction to CUDA - Brown CS
Most CUDA programs operate like... ○ Copy some data to Device memory. ○ Kernel launch (run kernel function on some data). ○ Wait...
Read more >
Memory management - Numba
Deallocation of all CUDA resources are tracked on a per-context basis. When the last reference to a device memory is dropped, the underlying...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found