possible bug in cupyx sparse matrix transpose/multiplication
See original GitHub issueHello cupy(x) developers,
I would like to report what looks like a bug in the cupyx sparse matrix library. I am running on a Skylake cpu and a Tesla V100 gpu on a SUSE-ish Linux system. CuPy version: 6.0.0. CUDA version 10.1.168. I am running inside a conda environment which you can find in my requirements.txt file:
Here is a reproducer in which I am trying to compare the same operations on the cpu and gpu and checking at every step that they are the same. Finally in the last step the np.allclose()
check fails. The iCov_max_diff
value is much larger than machine precision which suggests there is a bug.
import numpy as np
import scipy.sparse
import cupy as cp
import cupyx as cpx
m = n = 1000
np.random.seed(1)
#create random data to use in W
random_data = np.random.random(m)
#create A random sparse matrix
A_cpu = scipy.sparse.random(m, n, format='csr', random_state=42)
A_gpu = cpx.scipy.sparse.csr_matrix(A_cpu)
#yank gpu back and compare
A_yank = A_gpu.get()
assert np.allclose(A_cpu.todense(),A_yank.todense())
#create W random sparse matrix
W_cpu = scipy.sparse.spdiags(data=random_data, diags=[0,], m=m, n=n)
W_gpu = cpx.scipy.sparse.spdiags(data=random_data, diags=[0,], m=m, n=n)
#yank gpu back and compare
W_yank = W_gpu.get()
assert np.allclose(W_cpu.todense(),W_yank.todense())
#see how the dot products go
W_dot_A_cpu = W_cpu.dot(A_cpu)
W_dot_A_gpu = W_gpu.dot(A_gpu)
#yank gpu back and compare
W_dot_A_yank = W_dot_A_gpu.get()
assert np.allclose(W_dot_A_cpu.todense(), W_dot_A_yank.todense())
#check the transpose
A_trans_cpu = A_cpu.T
A_trans_gpu = A_gpu.T
#yank gpu back and compare
A_trans_yank = A_trans_gpu.get() #use get bc its a sparse object
assert np.allclose(A_trans_cpu.todense(), A_trans_yank.todense())
#okay now inverse covariance (where things go wrong)
iCov_cpu = A_cpu.T.dot(W_dot_A_cpu)
iCov_gpu = A_gpu.T.dot(W_dot_A_gpu)
#yank gpu back and compare
iCov_yank = iCov_gpu.get()
iCov_diff = iCov_cpu.todense() - iCov_yank.todense()
max_iCov_diff = np.max(iCov_diff)
print("max iCov diff")
print(max_iCov_diff)
assert np.allclose(iCov_cpu.todense(), iCov_yank.todense()) #fails for large matrix sizes
If there is more information I can provide please let me know. Of course if I am doing something wrong I would also be happy for your feedback about how to do this correctly.
Thank you very much for your help, Laurie
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (2 by maintainers)
Top GitHub Comments
Hi @econtal, I think so, following the definition in there coo2csr, this function converts the array containing the uncompressed row indices (corresponding to COO format) into an array of compressed row pointers (corresponding to CSR format). In the cuda definition: “Sparse matrices in CSR format are assumed to be stored in row-major CSR format, in other words, the index arrays are first sorted by row indices and then within the same row by column indices. It is assumed that each pair of row and column indices appears only once.”
Anyway, I made some tests using the code attached generating a random (1M by 1M) matrix with 0.1G non-zeros with duplicates, computing the same thing in different ways, for one thing the results are within numerical precision with the CPU implementation and each other, but the timing is also interesting: -without the modification time cupy left multiply (1st time): 3.2422685334458947 -with the modification time cupy left multiply (1st time): 0.013027168810367584
besides, performing the same computation, either as (S*x) (left-multiply) or (xT*ST).T (right multiply), where S is in csr and ST in csc, on GPU and CPU, can change speed quite dramatically: -time cupy rm (1st and 2nd time): 3.269, 3.2578 -time cupy lm (1st and 2nd time): 3.24, 0.00015796 with modified code: -time cupy lm (1st and 2nd time): 0.01377, 0.0001685973
for reference (single threaded scipy), -time scipy lm: 0.2227 (rm) 0.22039
sparse_test.py.txt
@econtal I manually applied your patch on cupy 6.2.0 with cuda 10.0, and the dot product result looks good. Thanks a lot for fixing this!
Personally, I would like to suggest explicitly add a test case for the transposed dot product. But I am not really familiar with the cupy test setup, so I will leave it to the cupy maintainers to decide.