Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to optimize DaCe SpMV? (unoptimized version 20x slower than SciPy sparse dot)

See original GitHub issue

Problem description

In the DaCe paper, it is stated that DaCe SpMV is as fast as MKL：

We observe similar results in SpMV, which is more complicated to optimize due to its irregular memory access characteristics. SDFGs are on par with MKL (99.9% performance) on CPU, and are successfully vectorized on GPUs.

However, I found it 20x slower than scipy.sparse.csr_matrix.dot. 1.4s vs 60ms for the problem size used in the DaCe paper.

Full code to reproduce: https://gist.github.com/learning-chip/1ef56f6ea707b063c3177e9f143f0905
The SpMV code was taken from https://github.com/spcl/dace/blob/v0.10.8/samples/simple/spmv.py
DaCe version: 0.10.8
Hardware: Intel Xeon 8180 CPU

It is probably because I did not apply any transformations to optimize performance. But I could not find more words in the paper about SpMV optimization, except this short paragraph:

Using explicit dataflow is beneficial when defining nontrivial data accesses. Fig. 4 depicts a full implementation of Sparse Matrix-Vector multiplication (SpMV). In the implementation, the access x[A_col[j]] is translated into an indirect access subgraph (see Appendix F) that can be identified and used in transformations.

I also tried this sample code from the paper, slightly different from the GitHub version. But got TypeError: dtype must be a DaCe type, got __map_8_b0 at runtime.

# From DaCe paper fig.4
@dace.program
def spmv(A_row: dace.uint32[H+1], A_col: dace.uint32[nnz],
            A_val: dace.float32 [nnz], x: dace.float32 [W],
            b: dace.float32[H]):
    for i in dace.map[0:H]:
        for j in dace.map[A_row[i]:A_row[i+1]]:
            with dace.tasklet :
                a << A_val[j]
                in_x << x[A_col[j]]
                out >> b(1, dace.sum)[i]
                out = a * in_x

Environment

Conda environment.yml is

name: dace
dependencies:
  - python=3.7.5
  - pip
  - pip:
    - jupyterlab
    - scipy==1.6.2
    - dace==0.10.8

Issue Analytics

State:
Created 2 years ago
Comments:9 (2 by maintainers)

Top GitHub Comments

1reaction

tbennuncommented, May 13, 2021

@learning-chip Thanks for providing the reproducible notebook. It made everything easier 👍

First of all, what you were timing includes compilation and running time. This was fixed since the latest release (should be part of the upcoming v0.10.9, which also includes Python 3.9 support). If you run the code through Binder with the latest master branch the time goes down to 16.4ms, but the first run will be slower since it will compile the code once.

If you want to use the current 0.10.8 release, you can also precompile the SDFG yourself:

cspmv = spmv.compile()
%time cspmv(A_row=A_row, A_col=A_col, A_val=A_val, x=x, b=b, H=H, W=W, nnz=nnz)

The second part of the question is the transformations. If I recall correctly, the set of transformations we applied was tiling the internal map and using vectorization to improve the runtime.

0reactions

learning-chipcommented, May 25, 2021

@sancierra @alexnick83 Wonderful, thanks for the detailed replies! Let me take a closer look.