Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cupy streams are not supporting concurrent GPU operation (cp.linalg.svd)

See original GitHub issue

Issue Cupy streams are not supporting concurrent GPU operation, for cp.linalg.svd. (This is my first time using Cupy, to try and do concurrent SVDs on the GPU with a stack of matrices.)

Copying @mrocklin and @seibert as they seem to have spent a lot of time with similar issues.

Related Links For background on this, see:

#1695: How to get concurrency from cupy Streams. (Closed with a change, to support longer operations, but that doesn’t seem to apply in this case.)
Dask 4040: Issues getting dask.delayed to work concurrently
example, map_reduce : link
example, concurrency in Cupy with streams: link
Related Stackoverflow question (suggests trying gesvdjBatched, in C++)

Conditions: CuPy Version : 7.2.0 CUDA Root : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0 CUDA Build Version : 10010 CUDA Driver Version : 10010 CUDA Runtime Version : 10010 cuBLAS Version : 10200 cuFFT Version : 10010 cuRAND Version : 10010 cuSOLVER Version : (10, 1, 0) cuSPARSE Version : 10010 NVRTC Version : (10, 1) cuDNN Build Version : 7605 cuDNN Version : 7605 NCCL Build Version : None NCCL Runtime Version : None

Code to reproduce [edit: fixed a bug and a couple of bad comments]

import time
import cupy as cp
import numpy as np
import dask
import dask.array as da

def many_svd_np_vs_cp():
    device = cp.cuda.Device()

    N = 16      # number of desired SVDs, grouped.
    M = 1024  # size of each matrix, for SVD (MxM)
    A = np.asarray(np.random.randn(N, M, M), dtype=np.float32)
    
    # ----- Prime Pump, to eliminate CUDA overhead in timings. ----- 
    A_gpu = cp.asarray(A)
    for i in range(16):  
        sg = cp.linalg.svd(A_gpu[0], compute_uv=False)
    time.sleep(0.25)  # to separate this, in nvvp

    # ----- Grouped SVDs in numpy ----- 
    tm = time.time()
    s_npall = np.linalg.svd(A, compute_uv=False)  # 256 x 16
    elaps = time.time() - tm
    print('%20s: elaps=%f' % ('Numpy', elaps))

    # ----- Cupy-Loop: grouped SVDs in cupy ----- 
    sg_all = cp.asarray([])
    tm = time.time()
    for i in range(A_gpu.shape[0]):
        sg = cp.linalg.svd(A_gpu[i], compute_uv=False)
        sg_all = cp.concatenate((sg_all, sg), axis=0) # N*16 = 4096, but that's OK
    s_cpall = cp.asnumpy(sg_all)
    elaps = time.time() - tm
    print('%20s: elaps=%f' % ('Cupy-Loop', elaps))
    time.sleep(0.20)

    # ----- Cupy-ListComp: is List Comprehension Faster? -----
    sg_all = cp.asarray([])
    tm = time.time()
    sg_all = [cp.linalg.svd(A_gpu[i], compute_uv=False) for i in range(A_gpu.shape[0])]
    s_cpall = cp.asnumpy(sg_all)
    elaps = time.time() - tm
    print('%20s: elaps=%f' % ('Cupy-ListComp', elaps))
    time.sleep(0.20)

    # ----- Cupy-Dask-Delayed: try using Dask.Delayed for parallelism/concurrency -----
    # TODO: not currently trying to retrieve the results, with this example.
    tm = time.time()
    tasks = [ dask.delayed(cp.linalg.svd)(A_gpu[i], compute_uv=False) for i in range(A_gpu.shape[0])]
    tasks_list = dask.delayed( list(tasks) )
    res = dask.compute(tasks_list)  # Does return a list of 256 x 16
    device.synchronize()
    elaps = time.time() - tm
    print('%20s: elaps=%f' % ('Cupy-Dask-Delayed', elaps))
    time.sleep(0.20)

    # ----- Cupy-Streams: try cupy streams for paralellism/concurrency -----
    # TODO: not currently trying to retrieve the results, with this example.
    device = cp.cuda.Device()
    map_streams = [cp.cuda.stream.Stream() for i in range(N)]
    tm = time.time()  # BUG: was start_time = time.time()
    for i, stream in enumerate(map_streams):
        with stream:
            sg = cp.linalg.svd(A_gpu[i], compute_uv=False)
            # This is a little worse:
            # C_gpu = cp.asarray(np.random.randn(M, M), dtype=np.float32) 
            # sg = cp.linalg.svd(C_gpu, compute_uv=False)
    device.synchronize()
    elaps = time.time() - tm
    print('%20s: elaps=%f' % ('Cupy-Streams', elaps))

if __name__ == "__main__":
    many_svd_np_vs_cp()

Output:

               Numpy: elaps=2.181430
           Cupy-Loop: elaps=1.396355
  Cupy-ListComp: elaps=1.467271
   Cupy-Dask-Delayed: elaps=1.206578
        Cupy-Streams: elaps=~1.6 (was 3.104342 with bug)

NVVP: Running in NVVP, here are two example of what the stream launches look like, for (N,M) = (64,256) and (N,M) = (16,1024), where N are the number of SVDs/streams, and the matrices are size MxM. Each SVD is taking 23 ms (or 116 ms in the second case), which clearly is enough time to try and launch them concurrently. Using ‘C_gpu’ instead of A_gpu[i] doesn’t make a difference.

nvvp_cupy_streams

Issue Analytics

State:
Created 4 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

2reactions

anarusecommented, Mar 11, 2020

Thank you for benchmarking cusolverDnSgesvdaStridedBatched so quickly, though the result of the 64 x 256x256 case is bit disappointing… Anyway, I will work on it later.

OK, thanks. But, see the image above; do you think the kernel launch by the CPU is more than 116 ms (in the 16 x 1024x1024 case)?

I run your repro on Quadro GV100 and CUDA 10.2. It took about 1.04 sec to compute the 16 x 1024x1024 case and number of kernels launched for this compute was neally 200K. Given each kernel needs 4 usec to launch, total kernel launch time becomes around 0.8 sec. It looks to me that kernel launch by CPU is performance limiter in this case…

        Cupy-Streams: elaps=1.042194

      API calls:   45.07%  840.33ms    196720  4.2710us  3.7370us  377.44us  cudaLaunchKernel

1reaction

drcdrcommented, Mar 11, 2020

@anaruse: OK, I see now, thanks for explaining. I just didn’t realize how much of an impact these kernel launches were!

@leofang: I agree, thanks, it seems like a very difficult limitation, especially for this use case.

Here are some more results on cusolverDnSgesvdaStridedBatched, and other svd approaches in MKL, and Python:

svd_compare

The table shows timings for several different approaches and cases, for batched SVDs: C++: (A) MKL using gebrd and bdsqr; (B) MKL using geqrf, gebrd, and bdsqr (recommended by Intel for SVD of tall/skinny); © CUDA, using gesvdaStridedBatched (gesvda), the ‘approximate algorithm’ Python: (D) numpy np.linalg.svd (I have numpy MKL installed) (E) cupy list comprehension [cp.linalg.svd(A_gpi[i]) …] (F) cupy streams An obvious missing approach is eig(A.T@A) for tall/skinny, but then accuracy, etc. would have to be considered.

Comments:

Excel-style conditional-formatting is done row-wise, to highlight best option per row.
Two different initializations are considered: Identity (e.g. np.eye(Nr,Nc)), and Random (gaussian(0,1) entries). The former might be considered a good case for ©; the latter might be considered a poor case for ©. At the very bottom, see the mean and stdev of (time-Identity / time-Random).
I’ve implemented a few C++ unit tests. First, for Hilbert matrices, the relative error on the largest eigenvalue is ~1e-7; the smallest, ~1e-3. Second, singular values of Identity match are 1 for MKL and CUDA. Third, eigenvalues match for random matrices, across the C++ approaches.
No multiple runs with confidence intervals here! Eyeballing it, it looks like +/- 1 to 10% variation across runs.

Some Timing Takeaways (CPU=4790K, GPU=TitanXP):

Identity, Column ©: CUgesvda shines when it doesn’t have to do too many iterations.
Random, Column ©: CUgesvda shines only when Nr/Nc is huge.
Random, Column © vs (E,F): CUgesvda is sometimes better, sometimes worse than current cupy.
Column (B) vs (D): maybe numpy should consider a tall/skinny routine? IDK, I’m sure there are other considerations here (stability, accuracy, etc.)
Column (E) vs (F): since no parallelism for cupy SVD, list comprehensions is better than streams.
Mean,Std(Iden./Rand.): python has lower variance, mean closer to 1.

There’s a lot to digest here. I know what I have to do for my application now, but let me know if you have any other questions. As far as the initial issue is concerned - maybe cusolverDnSgesvdaStridedBatched could be considered in cupy for tall/skinny cases.