Cupy streams are not supporting concurrent GPU operation (cp.linalg.svd)
See original GitHub issueIssue Cupy streams are not supporting concurrent GPU operation, for cp.linalg.svd. (This is my first time using Cupy, to try and do concurrent SVDs on the GPU with a stack of matrices.)
Copying @mrocklin and @seibert as they seem to have spent a lot of time with similar issues.
Related Links For background on this, see:
- #1695: How to get concurrency from cupy Streams. (Closed with a change, to support longer operations, but that doesn’t seem to apply in this case.)
- Dask 4040: Issues getting dask.delayed to work concurrently
- example, map_reduce : link
- example, concurrency in Cupy with streams: link
- Related Stackoverflow question (suggests trying gesvdjBatched, in C++)
Conditions: CuPy Version : 7.2.0 CUDA Root : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0 CUDA Build Version : 10010 CUDA Driver Version : 10010 CUDA Runtime Version : 10010 cuBLAS Version : 10200 cuFFT Version : 10010 cuRAND Version : 10010 cuSOLVER Version : (10, 1, 0) cuSPARSE Version : 10010 NVRTC Version : (10, 1) cuDNN Build Version : 7605 cuDNN Version : 7605 NCCL Build Version : None NCCL Runtime Version : None
Code to reproduce [edit: fixed a bug and a couple of bad comments]
import time
import cupy as cp
import numpy as np
import dask
import dask.array as da
def many_svd_np_vs_cp():
device = cp.cuda.Device()
N = 16 # number of desired SVDs, grouped.
M = 1024 # size of each matrix, for SVD (MxM)
A = np.asarray(np.random.randn(N, M, M), dtype=np.float32)
# ----- Prime Pump, to eliminate CUDA overhead in timings. -----
A_gpu = cp.asarray(A)
for i in range(16):
sg = cp.linalg.svd(A_gpu[0], compute_uv=False)
time.sleep(0.25) # to separate this, in nvvp
# ----- Grouped SVDs in numpy -----
tm = time.time()
s_npall = np.linalg.svd(A, compute_uv=False) # 256 x 16
elaps = time.time() - tm
print('%20s: elaps=%f' % ('Numpy', elaps))
# ----- Cupy-Loop: grouped SVDs in cupy -----
sg_all = cp.asarray([])
tm = time.time()
for i in range(A_gpu.shape[0]):
sg = cp.linalg.svd(A_gpu[i], compute_uv=False)
sg_all = cp.concatenate((sg_all, sg), axis=0) # N*16 = 4096, but that's OK
s_cpall = cp.asnumpy(sg_all)
elaps = time.time() - tm
print('%20s: elaps=%f' % ('Cupy-Loop', elaps))
time.sleep(0.20)
# ----- Cupy-ListComp: is List Comprehension Faster? -----
sg_all = cp.asarray([])
tm = time.time()
sg_all = [cp.linalg.svd(A_gpu[i], compute_uv=False) for i in range(A_gpu.shape[0])]
s_cpall = cp.asnumpy(sg_all)
elaps = time.time() - tm
print('%20s: elaps=%f' % ('Cupy-ListComp', elaps))
time.sleep(0.20)
# ----- Cupy-Dask-Delayed: try using Dask.Delayed for parallelism/concurrency -----
# TODO: not currently trying to retrieve the results, with this example.
tm = time.time()
tasks = [ dask.delayed(cp.linalg.svd)(A_gpu[i], compute_uv=False) for i in range(A_gpu.shape[0])]
tasks_list = dask.delayed( list(tasks) )
res = dask.compute(tasks_list) # Does return a list of 256 x 16
device.synchronize()
elaps = time.time() - tm
print('%20s: elaps=%f' % ('Cupy-Dask-Delayed', elaps))
time.sleep(0.20)
# ----- Cupy-Streams: try cupy streams for paralellism/concurrency -----
# TODO: not currently trying to retrieve the results, with this example.
device = cp.cuda.Device()
map_streams = [cp.cuda.stream.Stream() for i in range(N)]
tm = time.time() # BUG: was start_time = time.time()
for i, stream in enumerate(map_streams):
with stream:
sg = cp.linalg.svd(A_gpu[i], compute_uv=False)
# This is a little worse:
# C_gpu = cp.asarray(np.random.randn(M, M), dtype=np.float32)
# sg = cp.linalg.svd(C_gpu, compute_uv=False)
device.synchronize()
elaps = time.time() - tm
print('%20s: elaps=%f' % ('Cupy-Streams', elaps))
if __name__ == "__main__":
many_svd_np_vs_cp()
Output:
Numpy: elaps=2.181430
Cupy-Loop: elaps=1.396355
Cupy-ListComp: elaps=1.467271
Cupy-Dask-Delayed: elaps=1.206578
Cupy-Streams: elaps=~1.6 (was 3.104342 with bug)
NVVP: Running in NVVP, here are two example of what the stream launches look like, for (N,M) = (64,256) and (N,M) = (16,1024), where N are the number of SVDs/streams, and the matrices are size MxM. Each SVD is taking 23 ms (or 116 ms in the second case), which clearly is enough time to try and launch them concurrently. Using ‘C_gpu’ instead of A_gpu[i] doesn’t make a difference.
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (5 by maintainers)
Top GitHub Comments
Thank you for benchmarking
cusolverDnSgesvdaStridedBatched
so quickly, though the result of the 64 x 256x256 case is bit disappointing… Anyway, I will work on it later.I run your repro on Quadro GV100 and CUDA 10.2. It took about 1.04 sec to compute the 16 x 1024x1024 case and number of kernels launched for this compute was neally 200K. Given each kernel needs 4 usec to launch, total kernel launch time becomes around 0.8 sec. It looks to me that kernel launch by CPU is performance limiter in this case…
@anaruse: OK, I see now, thanks for explaining. I just didn’t realize how much of an impact these kernel launches were!
@leofang: I agree, thanks, it seems like a very difficult limitation, especially for this use case.
Here are some more results on
cusolverDnSgesvdaStridedBatched
, and other svd approaches in MKL, and Python:The table shows timings for several different approaches and cases, for batched SVDs: C++: (A) MKL using gebrd and bdsqr; (B) MKL using geqrf, gebrd, and bdsqr (recommended by Intel for SVD of tall/skinny); © CUDA, using gesvdaStridedBatched (gesvda), the ‘approximate algorithm’ Python: (D) numpy np.linalg.svd (I have numpy MKL installed) (E) cupy list comprehension [cp.linalg.svd(A_gpi[i]) …] (F) cupy streams An obvious missing approach is eig(A.T@A) for tall/skinny, but then accuracy, etc. would have to be considered.
Comments:
Some Timing Takeaways (CPU=4790K, GPU=TitanXP):
There’s a lot to digest here. I know what I have to do for my application now, but let me know if you have any other questions. As far as the initial issue is concerned - maybe
cusolverDnSgesvdaStridedBatched
could be considered in cupy for tall/skinny cases.