Multithreaded cufft memory leak
See original GitHub issueDescription
The cufft plan cache does not appear to deallocate GPU memory during thread clean up (GC). If I launch cp.fft.fft
on a thread and then exit/join the thread, there is residual memory still allocated on the GPU.
To Reproduce
from concurrent.futures import ThreadPoolExecutor
import cupy as cp
def task():
data = cp.ones(2**20)
return cp.fft.fft(data)
while True:
with ThreadPoolExecutor(max_workers=3) as executor:
tasks = [executor.submit(task) for _ in range(10)]
cp.get_default_memory_pool().free_all_blocks()
cp.get_default_pinned_memory_pool().free_all_blocks()
cp.fft.config.get_plan_cache().clear()
input("Check nvidia-smi memory usage... Press any key to run again")
Installation
Wheel (pip install cupy-***
)
Environment
OS : Linux-3.10.0-1160.36.2.el7.x86_64-x86_64-with-glibc2.29
Python Version : 3.8.10
CuPy Version : 10.0.0
CuPy Platform : NVIDIA CUDA
NumPy Version : 1.21.5
SciPy Version : 1.7.3
Cython Build Version : 0.29.24
Cython Runtime Version : None
CUDA Root : /usr/local/cuda
nvcc PATH : None
CUDA Build Version : 11040
CUDA Driver Version : 11040
CUDA Runtime Version : 11040
cuBLAS Version : (available)
cuFFT Version : 10502
cuRAND Version : 10205
cuSOLVER Version : (11, 2, 0)
cuSPARSE Version : (available)
NVRTC Version : (11, 4)
Thrust Version : 101201
CUB Build Version : 101201
Jitify Build Version : 60e9e72
cuDNN Build Version : (not loaded; try `import cupy.cuda.cudnn` first)
cuDNN Version : (not loaded; try `import cupy.cuda.cudnn` first)
NCCL Build Version : 21104
NCCL Runtime Version : 21104
cuTENSOR Version : None
cuSPARSELt Build Version : None
Device 0 Name : Quadro RTX 5000
Device 0 Compute Capability : 75
Device 0 PCI Bus ID : 0000:01:00.0
Additional Information
Runtime is within docker image nvidia/cuda:11.4.2-runtime-ubuntu20.04
Issue Analytics
- State:
- Created 2 years ago
- Comments:15 (15 by maintainers)
Top Results From Across the Web
cuFFT API Reference - NVIDIA Documentation Center
Functions in the cuFFT and cuFFTW library assume that the data is in GPU visible memory. This means any memory allocated by cudaMalloc...
Read more >Memory leaks in a multi-threaded application using COM & C# ...
I tried repeatedly creating the COM object from a C# program, both on the main thread and with another created thread. In both...
Read more >Writing CUDA Kernels — Numba 0.50.1 documentation
In CUDA, the code you write will be executed by multiple threads at once (often hundreds or thousands). Your solution will be modeled...
Read more >CUDA Streams: Best Practices and Common Pitfalls
2. Launch a GPU Kernel. 3. Copy results from GPU memory to CPU memory ... All CUDA calls are either synchronous or asynchronous...
Read more >Release Notes :: CUDA Toolkit Documentation - UCSD CSE
The new calls allow creation of a CUFFT plan handle separate from the ... leading to a GPU memory leak of about 256...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
OK. I tested it.
cupy=10.4.0
runs out of memory with the script above, but your patched branch does not run out of memory. I don’t have a real world test because I already patched my projects to work around this issue. Thanks @leofang! 😄I tried to test it, but I am having trouble compiling! I keep getting compiler errors (related to syntax not linking or missing libraries). Maybe I should just clone the conda-forge cupy-feedstock and build in their docker container! 😆