Very slow product for sparse matrices
See original GitHub issueDescription
Matrix product seems very slow for sparse matrices. I’m using 2 sparse matrices with 99.9% and 95% null entries. I compute the product between the matrices as follows:
- sparse matrix product -> 0.04 seconds
- convert to dense and compute the product -> ~0.00008 seconds~ See https://github.com/cupy/cupy/issues/6463#issuecomment-1048656629
As the matrices are highly sparse, I’d expect the sparse product to be faster, and definitely not ~500 times~ slower than the corresponding dense product. I’ve tried with cupy installed from wheel, and then from source. I’ve used both cuSPARSELt
0.1.0 and 0.2.0. I’ve timed with both timeit
and cupy benchmark
profiler. The results I get are always the same.
Any hint would be much appreciated! Thanks for your help!
To Reproduce
import cupy
from sklearn.datasets import fetch_20newsgroups
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
news = fetch_20newsgroups(subset='train')
x = TfidfVectorizer().fit_transform(news.data)
y = OneHotEncoder().fit_transform(news.target.reshape(-1, 1)).T
x_sparse = cupy.sparse.csr_matrix(x)
y_sparse = cupy.sparse.csr_matrix(y)
x_dense = x_sparse.todense()
y_dense = y_sparse.todense()
n_repeat = 10
def prod(x, y):
y @ x
from cupyx.profiler import benchmark
print(benchmark(prod, (x_sparse, y_sparse), n_repeat=n_repeat))
print(benchmark(prod, (x_dense, y_dense), n_repeat=n_repeat))
from timeit import timeit
print(timeit(lambda: prod(x_sparse, y_sparse), number=n_repeat) / n_repeat)
print(timeit(lambda: prod(x_dense, y_dense), number=n_repeat) / n_repeat)
Installation
Source (pip install cupy
)
Environment
OS : Linux-3.10.0-1160.53.1.el7.x86_64-x86_64-with-glibc2.17
Python Version : 3.8.12
CuPy Version : 10.1.0
CuPy Platform : NVIDIA CUDA
NumPy Version : 1.21.2
SciPy Version : 1.6.2
Cython Build Version : 0.29.27
Cython Runtime Version : 0.29.27
CUDA Root : /usr/local/cuda
nvcc PATH : /usr/local/cuda/bin/nvcc --compiler-bindir gcc
CUDA Build Version : 11050
CUDA Driver Version : 11050
CUDA Runtime Version : 11050
cuBLAS Version : (available)
cuFFT Version : 10600
cuRAND Version : 10207
cuSOLVER Version : (11, 3, 2)
cuSPARSE Version : (available)
NVRTC Version : (11, 5)
Thrust Version : 101301
CUB Build Version : 101301
Jitify Build Version : <unknown>
cuDNN Build Version : 8302
cuDNN Version : 8302
NCCL Build Version : 21104
NCCL Runtime Version : 21104
cuTENSOR Version : 10400
cuSPARSELt Build Version : 100
Device 0 Name : NVIDIA A100-SXM4-40GB
Device 0 Compute Capability : 80
Device 0 PCI Bus ID : 0000:00:04.0
Additional Information
No response
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
SparseMatrix is very slow - General Usage - Julia Discourse
I tried to conver sparse matrix c to dense, and then do the element-wise product, and that was much faster although it also...
Read more >Sparse matrix multiplication is too slow · Issue #16187 - GitHub
I'm comparing it to SciPy and it is clearly too slow. It can be 100 times slower on CPU, which makes it quite...
Read more >Constructing sparse matrix is very slow in Julia - Stack Overflow
I am trying to create a sparse matrix in Julia using theses variables: row, col and val. The size of each is 73141861...
Read more >Sparse matrix-vector product is slow - KDE-Forum
However perfomance of sparse matrix sparse vector product is at least 10 times slower that of direct matrix-matrix product.
Read more >Adding three integer sparse matrices is very slow. Adding only ...
Adding more than two sparse matrices in one step in Mathematica 9 is very slow (in fact I couldn't even wait for it...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @eguidotti
I ran your reproducer on a V100 GPU and got the following results.
It looks like you compared CPU time, but in this case I think it is better to compare GPU time. In GPU time, it is 409 ms for sparse matrices and 306 ms for dense matrices. It is still slower to prod with sparse matrices, though.
What are the sizes of the matrices? If the number of non-null entries is not big enough, you will not see any benefit.
Sorry, I did not realize that you were using a publicly available dataset, I will try to profile this and get back to you.