Sparse Dask Arrays
See original GitHub issueI briefly ran into a situation with @moody-marlin where sparse arrays would have been useful. I also expect this to be of use for upcoming medium-term work, so I thought I’d put down some thoughts and encourage discussion.
First, narrowing scope, there are several kinds of sparse array layouts. Many arrays that arise from structured meshes are probably not within near-term scope. What is approachable however is a blocked sparse array that has more-or-less the same blocked layout as current dask.arrays, but swaps out numpy.ndarrays for sparse arrays, like scipy.sparse.csr_matrix
.
I think that there are two ways to approach this (though I welcome others)
- We generalize the current dask.array code a bit so that we use generic functions like
lambda x, y: x.dot(y)
rather thannp.dot
. This enables more integration and probably less code duplication/effort but probably also increases complexity and thus maintenance on the common code path. - We make a separate dask.SparseArray collection. We would have to recreate a lot of the metadata handling into a new sparse_atop function (satop?) but I suspect that almost all relevant sparse array operations could be written easily with this operation. The scope for sparse matrices is much smaller.
If the goal is strictly for 2d sparse matrices then I think that option 2 is a clear-ish winner. There are enough special cases on what scipy-sparse matrices support that I think trying to tie it on to the current dask.array code would be unpleasant long-term.
However, there is some (potentially funded) interest in building a general tensor factorization library on top of dask.array. This would motivate multi-dimensional sparse arrays. We would need to build an in-memory version of this first, which we could probably tailor to fit nicely into the requirements expected by relevant dask.array operations without much code bloat. I suspect that we could go a long way by only generalizing elementwise operations, reductions, tensordot
, transpose
, and maybe slicing.
Comments, criticism, and general thoughts welcome. cc @siebert @jcrist @pitrou @shoyer @eriknw
Issue Analytics
- State:
- Created 6 years ago
- Comments:17 (13 by maintainers)
Top GitHub Comments
OK, I have a small multidimensional sparse array library that handles transpose, reshape, tensordot, and simple reductions. It mostly generalizes
scipy.sparse.coo_matrix
and does clever reshapings to accomplish tensordot by backing it off ofscipy.sparse.csr_matrix.dot
. The reductions satisfy the numpy interface (keepdims works as usual).https://github.com/mrocklin/sparse
I did most of what I perceive to be the challenging things. However many simple things still don’t work:
Review of https://github.com/mrocklin/sparse would be appreciated.
This makes three sparse ndarray-like libraries (that I know of):
All three attempt to follow the numpy array API where possible, with varying degrees of coverage. Perhaps we should try to pull them into one package, to avoid duplicated effort?