question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Sparse matrix multiply orders of magnitude slower than PyTorch

See original GitHub issue

I have been doing some profiling comparing cupy’s sparse matrix dot() with PyTorch’s mm() and I’m getting some very surprising results. I noticed that while cupy is using cusparse, PyTorch appears to be using only thrust/cub to do the multiply.

Strangely, I’m finding that it’s the cusparse csrgemm_kernel from cupy that is dominating the runtime of the multiply, taking 40ms. The end-to-end matrix multiply is taking upwards of 70ms for cupy. The PyTorch implementation is taking a little over 2ms to run the multiply end-to-end.

I’m attaching the nvvp file with my profiling results. I’m pretty surprised. The algorithm being profiled is a simple Multinomial Naive Bayes using the 20-newsgroups dataset from scikit-learn. I trained both algorithms twice in order to eliminate jit and cuda context creation from the comparison.

>>> cupy.show_config()
CuPy Version          : 6.5.0
CUDA Root             : /usr/local/cuda
CUDA Build Version    : 10000
CUDA Driver Version   : 10000
CUDA Runtime Version  : 10000
cuDNN Build Version   : 7600
cuDNN Version         : 7600
NCCL Build Version    : 2406
NCCL Runtime Version  : 2406

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
smarkesinicommented, Dec 17, 2019

sorry that I opened a different issue for SpMV, but in #2790 I attached a modification to cupyx/scipy/sparse/coo.py, that improved the timing for SpMV with 10M nnz from 0.06 to 0.00024 seconds. I could be related.

0reactions
leofangcommented, Nov 26, 2019

Thanks, I didn’t pay attention to the origin post and jumped into the middle of the thread. Could you share your cupy script so we can test locally ourselves? It’d offer a different view from an nvvp session.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Block Sparse Matrices for Smaller and Faster Language Models
This is a huge improvement on PyTorch sparse matrices: their current implementation is an order of magnitude slower than the dense one.
Read more >
Sparse multiplication slow? - autograd - PyTorch Forums
I'm working on a project where I have mulitplications Sparse x dense, where the sparse matrix is fixed (same case than the post...
Read more >
Pytorch Memory efficient implementation of indexed matrix ...
I want to calculate the following memory efficiently: ... We can use bmm for this but it is an order of magnitude slower...
Read more >
cuSPARSE - NVIDIA Documentation Center
cuSPARSE. The API reference guide for cuSPARSE, the CUDA sparse matrix library. ... Given the fact that the transpose operation y=L^T*x is 10x...
Read more >
1 Matrix algebra versus linear algebra
using special graph structures or matrix shapes is key to ... least an order of magnitude slower than a well-tuned matrix multiplication.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found