Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multithreaded cufft memory leak

See original GitHub issue

Description

The cufft plan cache does not appear to deallocate GPU memory during thread clean up (GC). If I launch cp.fft.fft on a thread and then exit/join the thread, there is residual memory still allocated on the GPU.

To Reproduce

from concurrent.futures import ThreadPoolExecutor
import cupy as cp

def task():
    data = cp.ones(2**20)
    return cp.fft.fft(data)

while True:
    with ThreadPoolExecutor(max_workers=3) as executor:
        tasks = [executor.submit(task) for _ in range(10)]
        
    cp.get_default_memory_pool().free_all_blocks()
    cp.get_default_pinned_memory_pool().free_all_blocks()
    cp.fft.config.get_plan_cache().clear()
    
    input("Check nvidia-smi memory usage... Press any key to run again")

Installation

Wheel (pip install cupy-***)

Environment

OS                           : Linux-3.10.0-1160.36.2.el7.x86_64-x86_64-with-glibc2.29
Python Version               : 3.8.10
CuPy Version                 : 10.0.0
CuPy Platform                : NVIDIA CUDA
NumPy Version                : 1.21.5
SciPy Version                : 1.7.3
Cython Build Version         : 0.29.24
Cython Runtime Version       : None
CUDA Root                    : /usr/local/cuda
nvcc PATH                    : None
CUDA Build Version           : 11040
CUDA Driver Version          : 11040
CUDA Runtime Version         : 11040
cuBLAS Version               : (available)
cuFFT Version                : 10502
cuRAND Version               : 10205
cuSOLVER Version             : (11, 2, 0)
cuSPARSE Version             : (available)
NVRTC Version                : (11, 4)
Thrust Version               : 101201
CUB Build Version            : 101201
Jitify Build Version         : 60e9e72
cuDNN Build Version          : (not loaded; try `import cupy.cuda.cudnn` first)
cuDNN Version                : (not loaded; try `import cupy.cuda.cudnn` first)
NCCL Build Version           : 21104
NCCL Runtime Version         : 21104
cuTENSOR Version             : None
cuSPARSELt Build Version     : None
Device 0 Name                : Quadro RTX 5000
Device 0 Compute Capability  : 75
Device 0 PCI Bus ID          : 0000:01:00.0

Additional Information

Runtime is within docker image nvidia/cuda:11.4.2-runtime-ubuntu20.04

Issue Analytics

State:
Created 2 years ago
Comments:15 (15 by maintainers)

Top GitHub Comments

1reaction

carterboxcommented, May 9, 2022

OK. I tested it. cupy=10.4.0 runs out of memory with the script above, but your patched branch does not run out of memory. I don’t have a real world test because I already patched my projects to work around this issue. Thanks @leofang! 😄

1reaction

carterboxcommented, May 9, 2022

I tried to test it, but I am having trouble compiling! I keep getting compiler errors (related to syntax not linking or missing libraries). Maybe I should just clone the conda-forge cupy-feedstock and build in their docker container! 😆