Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

About unified memory in Cupy

See original GitHub issue

Hi CuPy team,

Is there any documentation describing which CuPy functions supports unified memory ?

So far I’ve tested two examples. The first one is a dot product between large vectors, which worked for me:

import cupy as cp
pool = cp.cuda.MemoryPool(cp.cuda.malloc_managed)
cp.cuda.set_allocator(pool.malloc)
size = 32768
a = cp.ones((size, size)) # 8GB
b = cp.ones((size, size)) # 8GB
cp.dot(a, b)

and the second, is a simple SVD test:

import os
import time
import numpy as np

import cupy as cp
from cupy.cuda.memory import malloc_managed

cp.cuda.set_allocator(malloc_managed)

tAccum = 0
x = np.random.random ((50000,10000))
print ("MB ", x.nbytes/1024)

t0 = time.time()
d_x = cp.asarray(x)
t1 = time.time()
dt = t1 - t0
print('H to D transfer ',  dt,  ' sec')

tAccum += dt

t0 = time.time()
d_u, d_s, d_v = cp.linalg.svd(d_x)
t1 = time.time()
dt = t1 - t0
print('SVD ', dt, ' sec')

tAccum += dt

t0 = time.time()
u = cp.asnumpy(d_u)
s = cp.asnumpy(d_s)
v = cp.asnumpy(d_v)
t1 = time.time()
dt = t1 - t0
print('D to H transfer ',  dt, ' sec')

tAccum += dt
print ('Total ', tAccum, ' sec')

which fails with the following error:

Traceback (most recent call last):
  File "svd.py", line 25, in <module>
    d_u, d_s, d_v = cp.linalg.svd(d_x)
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.11_gcc_6.4.0/lib/python3.7/site-packages/cupy-7.1.1-py3.7-linux-ppc64le.egg/cupy/linalg/decomposition.py", line 307, in svd
    buffersize = gesvd_bufferSize(handle, m, n)
  File "cupy/cuda/cusolver.pyx", line 1237, in cupy.cuda.cusolver.dgesvd_bufferSize
  File "cupy/cuda/cusolver.pyx", line 1242, in cupy.cuda.cusolver.dgesvd_bufferSize
  File "cupy/cuda/cusolver.pyx", line 440, in cupy.cuda.cusolver.check_status
cupy.cuda.cusolver.CUSOLVERError: CUSOLVER_STATUS_INVALID_VALUE

We are doing benchmarking on Power9 to know the behavior of CuPy for datasets bigger than 16 GB and knowing about what CuPy features work and what doesn’t with unified memory will allow us to progress faster.

PD, according to this technical report, section 3.6

https://developer.nvidia.com/sites/default/files/akamai/cuda/files/Misc/mygpu.pdf

unified memory can be expressed in cuSolver

System configuration

IBM Power System AC922. 2x POWER9 CPU (84 smt cores each) 512 GB RAM, 6x NVIDIA Volta GPU with 16 GB HBM2 GCC 6.4 CUDA 10.1.168 NVIDIA Driver 418.67 CuPy 7.1.1

Thanks,

Benjamin

Issue Analytics

State:
Created 4 years ago
Reactions:3
Comments:17 (11 by maintainers)

Top GitHub Comments

4reactions

anarusecommented, Mar 10, 2020

FYI, respose from cuSolver team.

GESVD checks if the matrix size exceeds 32-bit signed integer or not because API only supports 32-bit integer.
In this case, size of matrix U exceeds 2^31-1 .
The constraints do not mean GESVD cannot work for large dimension, it is simply a condition that we set up for 32-bit sign integer.
We are working on 64-bit API to resolve this issue.

3reactions

benjhacommented, Mar 6, 2020

Thank you all for your comments and feedback.

Good to know it is not a problem directly related to how CuPy’s uses unified memory.

@emcastillo @anaruse @leofang We are testing/benchmarking CuPy and NV Rapids with large memory allocations in Summit supercomputer using its production environment. Our ultimate goal is to offer scalable CPU and GPU based analytics to our users.

Top Results From Across the Web

Memory Management — CuPy 11.4.0 documentation

CuPy uses memory pool for memory allocations by default. The memory pool significantly improves the performance by mitigating the overhead of memory allocation ......

Cupy freeing unified memory - Stack Overflow

I have a problem with freeing allocated memory in cupy. Due to memory constraints, I want to use unified memory.

Improving GPU Memory Oversubscription Performance

Unified Memory can be used to make virtual memory allocations larger than available GPU memory. At the event of oversubscription, GPU ...

Shared Memory and Synchronization – GPU Programming

Shared memory is a CUDA memory space that is shared by all threads in a thread block. In this case shared means that...

CuPy Documentation - Read the Docs

CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated ... control over grid size, block size, shared memory size and stream.