Hitting a precision limit in `cupy.sparse.csr_matrix` when using `multiply`.
See original GitHub issueI’ve come across a bug in cupy.sparse.csr_matrix
which seems to happen only some of the time. I have created a minimally working example below. This example is quite contrived in order to get the error message to happen every time, and the matrix is actually fully sparse. I am coming across the same bug in my research but it doesn’t happen consistently, and in those cases I am using a sparse matrix.
In the MWE we create a huge dense matrix, and multiply it by a vector. The multiply
method is actually turning the csr_matrix
into a coo_matrix
. This happens fine for the first iteration, but on a second iteration this breaks with the following error message.
It seems that the col
array hits the precision limit for an int32
and then contains values that are extremely large and negative, or extremely large, and this causes the error message. However, the array is not large enough that it should hit that precision limit.
- Conditions (you can just paste the output of
python -c 'import cupy; cupy.show_config()'
)
OS : Linux-4.12.14-122.80.1.20210720-nasa-x86_64-with-glibc2.10
Python Version : 3.8.5
CuPy Version : 9.4.0
CuPy Platform : NVIDIA CUDA
NumPy Version : 1.19.2
SciPy Version : 1.7.1
Cython Build Version : 0.29.24
Cython Runtime Version : None
CUDA Root : /home4/jimartin/.conda/envs/gpu_hack
nvcc PATH : None
CUDA Build Version : 11000
CUDA Driver Version : 11000
CUDA Runtime Version : 11000
cuBLAS Version : (available)
cuFFT Version : 10201
cuRAND Version : 10201
cuSOLVER Version : (10, 6, 0)
cuSPARSE Version : (available)
NVRTC Version : (11, 0)
Thrust Version : 100909
CUB Build Version : 100909
Jitify Build Version : <unknown>
cuDNN Build Version : None
cuDNN Version : None
NCCL Build Version : None
NCCL Runtime Version : None
cuTENSOR Version : None
cuSPARSELt Build Version : None
Device 0 Name : Tesla V100-SXM2-32GB
Device 0 Compute Capability : 70
Device 0 PCI Bus ID : 0000:16:00.0
- Code to reproduce
from cupy import sparse
import cupy as cp
def run():
A = sparse.csr_matrix(cp.ones((2048**2, 25)))
return A.multiply(cp.random.normal(size=2048**2)[:, None])
run()
run()
- Error messages, stack traces, or logs
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-ec9775ede022> in <module>
----> 1 run()
<ipython-input-1-1d6522e0f920> in run()
4 def run():
5 A = sparse.csr_matrix(cp.ones((2048**2, 25)))
----> 6 return A.multiply(cp.random.normal(size=2048**2)[:, None])
~/.conda/envs/gpu_hack/lib/python3.8/site-packages/cupyx/scipy/sparse/csr.py in multiply(self, other)
321 return multiply_by_scalar(self, other)
322 elif _util.isdense(other):
--> 323 self.sum_duplicates()
324 other = cupy.atleast_2d(other)
325 return multiply_by_dense(self, other)
~/.conda/envs/gpu_hack/lib/python3.8/site-packages/cupyx/scipy/sparse/compressed.py in sum_duplicates(self)
776 # TODO(leofang): add a kernel for compressed sparse matrices without
777 # converting to coo
--> 778 coo = self.tocoo()
779 coo.sum_duplicates()
780 self.__init__(coo.asformat(self.format))
~/.conda/envs/gpu_hack/lib/python3.8/site-packages/cupyx/scipy/sparse/csr.py in tocoo(self, copy)
436 indices = self.indices
437
--> 438 return cusparse.csr2coo(self, data, indices)
439
440 def tocsc(self, copy=False):
~/.conda/envs/gpu_hack/lib/python3.8/site-packages/cupy/cusparse.py in csr2coo(x, data, indices)
928 _cusparse.CUSPARSE_INDEX_BASE_ZERO)
929 # data and indices did not need to be copied already
--> 930 return cupyx.scipy.sparse.coo_matrix(
931 (data, (row, indices)), shape=x.shape)
932
~/.conda/envs/gpu_hack/lib/python3.8/site-packages/cupyx/scipy/sparse/coo.py in __init__(self, arg1, shape, dtype, copy)
157 raise ValueError('row index exceeds matrix dimensions')
158 if col.max() >= shape[1]:
--> 159 raise ValueError('column index exceeds matrix dimensions')
160 if row.min() < 0:
161 raise ValueError('negative row index found')
ValueError: column index exceeds matrix dimensions
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (9 by maintainers)
Top GitHub Comments
@christinahedges I tested the example code on the Perlmutter system at NERSC using the cudatoolkit modules available on that system. I can confirm this fails in 11.0 and seems to work starting in 11.2 (there does not seem to be a cudatoolkit module compatible with 11.1).
I’ve confirmed that
cupy-cuda110==9.4.0
andcupy-cuda111==9.4.0
reproduce the error, whilecupy-cuda112==9.4.0
doesn’t.