Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Very slow product for sparse matrices

See original GitHub issue

Description

Matrix product seems very slow for sparse matrices. I’m using 2 sparse matrices with 99.9% and 95% null entries. I compute the product between the matrices as follows:

sparse matrix product -> 0.04 seconds
convert to dense and compute the product -> ~0.00008 seconds~ See https://github.com/cupy/cupy/issues/6463#issuecomment-1048656629

As the matrices are highly sparse, I’d expect the sparse product to be faster, and definitely not ~500 times~ slower than the corresponding dense product. I’ve tried with cupy installed from wheel, and then from source. I’ve used both cuSPARSELt 0.1.0 and 0.2.0. I’ve timed with both timeit and cupy benchmark profiler. The results I get are always the same.

Any hint would be much appreciated! Thanks for your help!

To Reproduce

import cupy
from sklearn.datasets import fetch_20newsgroups
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer


news = fetch_20newsgroups(subset='train')

x = TfidfVectorizer().fit_transform(news.data)
y = OneHotEncoder().fit_transform(news.target.reshape(-1, 1)).T

x_sparse = cupy.sparse.csr_matrix(x)
y_sparse = cupy.sparse.csr_matrix(y)
    
x_dense = x_sparse.todense()
y_dense = y_sparse.todense()

n_repeat = 10

def prod(x, y):
    y @ x

from cupyx.profiler import benchmark
print(benchmark(prod, (x_sparse, y_sparse), n_repeat=n_repeat))
print(benchmark(prod, (x_dense, y_dense), n_repeat=n_repeat))

from timeit import timeit
print(timeit(lambda: prod(x_sparse, y_sparse), number=n_repeat) / n_repeat)
print(timeit(lambda: prod(x_dense, y_dense), number=n_repeat) / n_repeat)

Installation

Source (pip install cupy)

Environment

OS                           : Linux-3.10.0-1160.53.1.el7.x86_64-x86_64-with-glibc2.17
Python Version               : 3.8.12
CuPy Version                 : 10.1.0
CuPy Platform                : NVIDIA CUDA
NumPy Version                : 1.21.2
SciPy Version                : 1.6.2
Cython Build Version         : 0.29.27
Cython Runtime Version       : 0.29.27
CUDA Root                    : /usr/local/cuda
nvcc PATH                    : /usr/local/cuda/bin/nvcc --compiler-bindir gcc
CUDA Build Version           : 11050
CUDA Driver Version          : 11050
CUDA Runtime Version         : 11050
cuBLAS Version               : (available)
cuFFT Version                : 10600
cuRAND Version               : 10207
cuSOLVER Version             : (11, 3, 2)
cuSPARSE Version             : (available)
NVRTC Version                : (11, 5)
Thrust Version               : 101301
CUB Build Version            : 101301
Jitify Build Version         : <unknown>
cuDNN Build Version          : 8302
cuDNN Version                : 8302
NCCL Build Version           : 21104
NCCL Runtime Version         : 21104
cuTENSOR Version             : 10400
cuSPARSELt Build Version     : 100
Device 0 Name                : NVIDIA A100-SXM4-40GB
Device 0 Compute Capability  : 80
Device 0 PCI Bus ID          : 0000:00:04.0

Additional Information

No response

Issue Analytics

State:
Created 2 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

anarusecommented, Feb 23, 2022

Hi @eguidotti

I ran your reproducer on a V100 GPU and got the following results.

prod : CPU:14013.300 us +/- 9.235 (min:13997.990 / max:14027.839) us GPU-0:40940.442 us +/-12.410 (min:40923.138 / max:40967.167) us prod : CPU: 33.642 us +/- 3.235 (min: 31.219 / max: 42.260) us GPU-0:30652.928 us +/- 5.538 (min:30643.200 / max:30659.584) us 0.03822545510047348 2.949680056190118e-05

It looks like you compared CPU time, but in this case I think it is better to compare GPU time. In GPU time, it is 409 ms for sparse matrices and 306 ms for dense matrices. It is still slower to prod with sparse matrices, though.

1reaction

emcastillocommented, Feb 15, 2022

What are the sizes of the matrices? If the number of non-null entries is not big enough, you will not see any benefit.

Sorry, I did not realize that you were using a publicly available dataset, I will try to profile this and get back to you.