Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Sparse matrix multiply orders of magnitude slower than PyTorch

See original GitHub issue

I have been doing some profiling comparing cupy’s sparse matrix dot() with PyTorch’s mm() and I’m getting some very surprising results. I noticed that while cupy is using cusparse, PyTorch appears to be using only thrust/cub to do the multiply.

Strangely, I’m finding that it’s the cusparse csrgemm_kernel from cupy that is dominating the runtime of the multiply, taking 40ms. The end-to-end matrix multiply is taking upwards of 70ms for cupy. The PyTorch implementation is taking a little over 2ms to run the multiply end-to-end.

I’m attaching the nvvp file with my profiling results. I’m pretty surprised. The algorithm being profiled is a simple Multinomial Naive Bayes using the 20-newsgroups dataset from scikit-learn. I trained both algorithms twice in order to eliminate jit and cuda context creation from the comparison.

>>> cupy.show_config()
CuPy Version          : 6.5.0
CUDA Root             : /usr/local/cuda
CUDA Build Version    : 10000
CUDA Driver Version   : 10000
CUDA Runtime Version  : 10000
cuDNN Build Version   : 7600
cuDNN Version         : 7600
NCCL Build Version    : 2406
NCCL Runtime Version  : 2406

Issue Analytics

State:
Created 4 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

2reactions

smarkesinicommented, Dec 17, 2019

sorry that I opened a different issue for SpMV, but in #2790 I attached a modification to cupyx/scipy/sparse/coo.py, that improved the timing for SpMV with 10M nnz from 0.06 to 0.00024 seconds. I could be related.

0reactions

leofangcommented, Nov 26, 2019

Thanks, I didn’t pay attention to the origin post and jumped into the middle of the thread. Could you share your cupy script so we can test locally ourselves? It’d offer a different view from an nvvp session.