How to optimize DaCe SpMV? (unoptimized version 20x slower than SciPy sparse dot)
See original GitHub issueProblem description
In the DaCe paper, it is stated that DaCe SpMV is as fast as MKL:
We observe similar results in SpMV, which is more complicated to optimize due to its irregular memory access characteristics. SDFGs are on par with MKL (99.9% performance) on CPU, and are successfully vectorized on GPUs.
However, I found it 20x slower than scipy.sparse.csr_matrix.dot
. 1.4s vs 60ms for the problem size used in the DaCe paper.
- Full code to reproduce: https://gist.github.com/learning-chip/1ef56f6ea707b063c3177e9f143f0905
- The SpMV code was taken from https://github.com/spcl/dace/blob/v0.10.8/samples/simple/spmv.py
- DaCe version: 0.10.8
- Hardware: Intel Xeon 8180 CPU
It is probably because I did not apply any transformations to optimize performance. But I could not find more words in the paper about SpMV optimization, except this short paragraph:
Using explicit dataflow is beneficial when defining nontrivial data accesses. Fig. 4 depicts a full implementation of Sparse Matrix-Vector multiplication (SpMV). In the implementation, the access x[A_col[j]] is translated into an indirect access subgraph (see Appendix F) that can be identified and used in transformations.
I also tried this sample code from the paper, slightly different from the GitHub version. But got TypeError: dtype must be a DaCe type, got __map_8_b0
at runtime.
# From DaCe paper fig.4
@dace.program
def spmv(A_row: dace.uint32[H+1], A_col: dace.uint32[nnz],
A_val: dace.float32 [nnz], x: dace.float32 [W],
b: dace.float32[H]):
for i in dace.map[0:H]:
for j in dace.map[A_row[i]:A_row[i+1]]:
with dace.tasklet :
a << A_val[j]
in_x << x[A_col[j]]
out >> b(1, dace.sum)[i]
out = a * in_x
Environment
Conda environment.yml
is
name: dace
dependencies:
- python=3.7.5
- pip
- pip:
- jupyterlab
- scipy==1.6.2
- dace==0.10.8
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (2 by maintainers)
Top GitHub Comments
@learning-chip Thanks for providing the reproducible notebook. It made everything easier 👍
First of all, what you were timing includes compilation and running time. This was fixed since the latest release (should be part of the upcoming v0.10.9, which also includes Python 3.9 support). If you run the code through Binder with the latest master branch the time goes down to 16.4ms, but the first run will be slower since it will compile the code once.
If you want to use the current 0.10.8 release, you can also precompile the SDFG yourself:
The second part of the question is the transformations. If I recall correctly, the set of transformations we applied was tiling the internal map and using vectorization to improve the runtime.
@sancierra @alexnick83 Wonderful, thanks for the detailed replies! Let me take a closer look.