Sparse matrix multiply orders of magnitude slower than PyTorch
See original GitHub issueI have been doing some profiling comparing cupy’s sparse matrix dot()
with PyTorch’s mm()
and I’m getting some very surprising results. I noticed that while cupy is using cusparse, PyTorch appears to be using only thrust/cub to do the multiply.
Strangely, I’m finding that it’s the cusparse csrgemm_kernel
from cupy that is dominating the runtime of the multiply, taking 40ms. The end-to-end matrix multiply is taking upwards of 70ms for cupy. The PyTorch implementation is taking a little over 2ms to run the multiply end-to-end.
I’m attaching the nvvp file with my profiling results. I’m pretty surprised. The algorithm being profiled is a simple Multinomial Naive Bayes using the 20-newsgroups dataset from scikit-learn. I trained both algorithms twice in order to eliminate jit and cuda context creation from the comparison.
>>> cupy.show_config()
CuPy Version : 6.5.0
CUDA Root : /usr/local/cuda
CUDA Build Version : 10000
CUDA Driver Version : 10000
CUDA Runtime Version : 10000
cuDNN Build Version : 7600
cuDNN Version : 7600
NCCL Build Version : 2406
NCCL Runtime Version : 2406
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (9 by maintainers)
Top GitHub Comments
sorry that I opened a different issue for SpMV, but in #2790 I attached a modification to cupyx/scipy/sparse/coo.py, that improved the timing for SpMV with 10M nnz from 0.06 to 0.00024 seconds. I could be related.
Thanks, I didn’t pay attention to the origin post and jumped into the middle of the thread. Could you share your cupy script so we can test locally ourselves? It’d offer a different view from an nvvp session.