question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multithreaded cufft memory leak

See original GitHub issue

Description

The cufft plan cache does not appear to deallocate GPU memory during thread clean up (GC). If I launch cp.fft.fft on a thread and then exit/join the thread, there is residual memory still allocated on the GPU.

To Reproduce

from concurrent.futures import ThreadPoolExecutor
import cupy as cp

def task():
    data = cp.ones(2**20)
    return cp.fft.fft(data)

while True:
    with ThreadPoolExecutor(max_workers=3) as executor:
        tasks = [executor.submit(task) for _ in range(10)]
        
    cp.get_default_memory_pool().free_all_blocks()
    cp.get_default_pinned_memory_pool().free_all_blocks()
    cp.fft.config.get_plan_cache().clear()
    
    input("Check nvidia-smi memory usage... Press any key to run again")

Installation

Wheel (pip install cupy-***)

Environment

OS                           : Linux-3.10.0-1160.36.2.el7.x86_64-x86_64-with-glibc2.29
Python Version               : 3.8.10
CuPy Version                 : 10.0.0
CuPy Platform                : NVIDIA CUDA
NumPy Version                : 1.21.5
SciPy Version                : 1.7.3
Cython Build Version         : 0.29.24
Cython Runtime Version       : None
CUDA Root                    : /usr/local/cuda
nvcc PATH                    : None
CUDA Build Version           : 11040
CUDA Driver Version          : 11040
CUDA Runtime Version         : 11040
cuBLAS Version               : (available)
cuFFT Version                : 10502
cuRAND Version               : 10205
cuSOLVER Version             : (11, 2, 0)
cuSPARSE Version             : (available)
NVRTC Version                : (11, 4)
Thrust Version               : 101201
CUB Build Version            : 101201
Jitify Build Version         : 60e9e72
cuDNN Build Version          : (not loaded; try `import cupy.cuda.cudnn` first)
cuDNN Version                : (not loaded; try `import cupy.cuda.cudnn` first)
NCCL Build Version           : 21104
NCCL Runtime Version         : 21104
cuTENSOR Version             : None
cuSPARSELt Build Version     : None
Device 0 Name                : Quadro RTX 5000
Device 0 Compute Capability  : 75
Device 0 PCI Bus ID          : 0000:01:00.0

Additional Information

Runtime is within docker image nvidia/cuda:11.4.2-runtime-ubuntu20.04

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:15 (15 by maintainers)

github_iconTop GitHub Comments

1reaction
carterboxcommented, May 9, 2022

OK. I tested it. cupy=10.4.0 runs out of memory with the script above, but your patched branch does not run out of memory. I don’t have a real world test because I already patched my projects to work around this issue. Thanks @leofang! 😄

1reaction
carterboxcommented, May 9, 2022

I tried to test it, but I am having trouble compiling! I keep getting compiler errors (related to syntax not linking or missing libraries). Maybe I should just clone the conda-forge cupy-feedstock and build in their docker container! 😆

Read more comments on GitHub >

github_iconTop Results From Across the Web

cuFFT API Reference - NVIDIA Documentation Center
Functions in the cuFFT and cuFFTW library assume that the data is in GPU visible memory. This means any memory allocated by cudaMalloc...
Read more >
Memory leaks in a multi-threaded application using COM & C# ...
I tried repeatedly creating the COM object from a C# program, both on the main thread and with another created thread. In both...
Read more >
Writing CUDA Kernels — Numba 0.50.1 documentation
In CUDA, the code you write will be executed by multiple threads at once (often hundreds or thousands). Your solution will be modeled...
Read more >
CUDA Streams: Best Practices and Common Pitfalls
2. Launch a GPU Kernel. 3. Copy results from GPU memory to CPU memory ... All CUDA calls are either synchronous or asynchronous...
Read more >
Release Notes :: CUDA Toolkit Documentation - UCSD CSE
The new calls allow creation of a CUFFT plan handle separate from the ... leading to a GPU memory leak of about 256...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found