question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to optimize DaCe SpMV? (unoptimized version 20x slower than SciPy sparse dot)

See original GitHub issue

Problem description

In the DaCe paper, it is stated that DaCe SpMV is as fast as MKL:

We observe similar results in SpMV, which is more complicated to optimize due to its irregular memory access characteristics. SDFGs are on par with MKL (99.9% performance) on CPU, and are successfully vectorized on GPUs.

However, I found it 20x slower than scipy.sparse.csr_matrix.dot. 1.4s vs 60ms for the problem size used in the DaCe paper.

It is probably because I did not apply any transformations to optimize performance. But I could not find more words in the paper about SpMV optimization, except this short paragraph:

Using explicit dataflow is beneficial when defining nontrivial data accesses. Fig. 4 depicts a full implementation of Sparse Matrix-Vector multiplication (SpMV). In the implementation, the access x[A_col[j]] is translated into an indirect access subgraph (see Appendix F) that can be identified and used in transformations.

I also tried this sample code from the paper, slightly different from the GitHub version. But got TypeError: dtype must be a DaCe type, got __map_8_b0 at runtime.

# From DaCe paper fig.4
@dace.program
def spmv(A_row: dace.uint32[H+1], A_col: dace.uint32[nnz],
            A_val: dace.float32 [nnz], x: dace.float32 [W],
            b: dace.float32[H]):
    for i in dace.map[0:H]:
        for j in dace.map[A_row[i]:A_row[i+1]]:
            with dace.tasklet :
                a << A_val[j]
                in_x << x[A_col[j]]
                out >> b(1, dace.sum)[i]
                out = a * in_x

Environment

Conda environment.yml is

name: dace
dependencies:
  - python=3.7.5
  - pip
  - pip:
    - jupyterlab
    - scipy==1.6.2
    - dace==0.10.8

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:9 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
tbennuncommented, May 13, 2021

@learning-chip Thanks for providing the reproducible notebook. It made everything easier 👍

First of all, what you were timing includes compilation and running time. This was fixed since the latest release (should be part of the upcoming v0.10.9, which also includes Python 3.9 support). If you run the code through Binder with the latest master branch the time goes down to 16.4ms, but the first run will be slower since it will compile the code once.

If you want to use the current 0.10.8 release, you can also precompile the SDFG yourself:

cspmv = spmv.compile()
%time cspmv(A_row=A_row, A_col=A_col, A_val=A_val, x=x, b=b, H=H, W=W, nnz=nnz)

The second part of the question is the transformations. If I recall correctly, the set of transformations we applied was tiling the internal map and using vectorization to improve the runtime.

0reactions
learning-chipcommented, May 25, 2021

@sancierra @alexnick83 Wonderful, thanks for the detailed replies! Let me take a closer look.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why is vector dot product slower with scipy's sparse csr_matrix ...
The CSR/CSC vector product call has a few microsecond overhead per call, from executing a tiny bit of Python code, and dealing with ......
Read more >
Scipy sparse matrices seem to have large overhead #5004
I understand that multiplying sparse matrices requires a bit more checking but 100-200x slower seems excessive. In [14]: import scipy.sparse ...
Read more >
Sparse matrices (scipy.sparse) — SciPy v1.9.3 Manual
Matrix vector product#. To do a vector product between a sparse matrix and a vector simply use the matrix dot method, as described...
Read more >
CUSPARSE much slower than scipy.sparse?
Hi, I am having issues making a sparse matrix multiplication work fast using CUSPARSE on a linux server. Initially, I was calling CUSPARSE ......
Read more >
A comparison of dense and sparse matrix vector multiplication ...
Intel's Math Kernel Library (MKL) provides BLAS routines for sparse matrices. In the following we are going to use those to compare their...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found