Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Sparse Dask Arrays

See original GitHub issue

I briefly ran into a situation with @moody-marlin where sparse arrays would have been useful. I also expect this to be of use for upcoming medium-term work, so I thought I’d put down some thoughts and encourage discussion.

First, narrowing scope, there are several kinds of sparse array layouts. Many arrays that arise from structured meshes are probably not within near-term scope. What is approachable however is a blocked sparse array that has more-or-less the same blocked layout as current dask.arrays, but swaps out numpy.ndarrays for sparse arrays, like scipy.sparse.csr_matrix.

I think that there are two ways to approach this (though I welcome others)

We generalize the current dask.array code a bit so that we use generic functions like lambda x, y: x.dot(y) rather than np.dot. This enables more integration and probably less code duplication/effort but probably also increases complexity and thus maintenance on the common code path.
We make a separate dask.SparseArray collection. We would have to recreate a lot of the metadata handling into a new sparse_atop function (satop?) but I suspect that almost all relevant sparse array operations could be written easily with this operation. The scope for sparse matrices is much smaller.

If the goal is strictly for 2d sparse matrices then I think that option 2 is a clear-ish winner. There are enough special cases on what scipy-sparse matrices support that I think trying to tie it on to the current dask.array code would be unpleasant long-term.

However, there is some (potentially funded) interest in building a general tensor factorization library on top of dask.array. This would motivate multi-dimensional sparse arrays. We would need to build an in-memory version of this first, which we could probably tailor to fit nicely into the requirements expected by relevant dask.array operations without much code bloat. I suspect that we could go a long way by only generalizing elementwise operations, reductions, tensordot, transpose, and maybe slicing.

Comments, criticism, and general thoughts welcome. cc @siebert @jcrist @pitrou @shoyer @eriknw

Issue Analytics

State:
Created 6 years ago
Comments:17 (13 by maintainers)

Top GitHub Comments

2reactions

mrocklincommented, Apr 16, 2017

OK, I have a small multidimensional sparse array library that handles transpose, reshape, tensordot, and simple reductions. It mostly generalizes scipy.sparse.coo_matrix and does clever reshapings to accomplish tensordot by backing it off of scipy.sparse.csr_matrix.dot. The reductions satisfy the numpy interface (keepdims works as usual).

https://github.com/mrocklin/sparse

In [1]: import numpy as np, random, sparse

In [2]: def random_x(shape, dtype=float):
   ...:     x = np.zeros(shape=shape, dtype=float)
   ...:     for i in range(max(5, np.prod(x.shape) // 3)):
   ...:         x[tuple(random.randint(0, d - 1) for d in x.shape)] = random.randint(0, 100)
   ...:     return x
   ...: 

In [3]: x = random_x((2, 3, 4))

In [4]: y = random_x((3, 4, 5))

In [5]: x
Out[5]: 
array([[[  0.,   0.,   0.,   0.],
        [ 34.,  63.,   0.,   0.],
        [ 79.,   0.,  13.,   0.]],

       [[  0.,  77.,  93.,   0.],
        [ 79.,   0.,   0.,   0.],
        [  0.,  62.,   0.,   0.]]])

In [6]: y
Out[6]: 
array([[[  0.,  52.,   0.,   0.,   0.],
        [ 28.,  36.,   0.,   0.,   0.],
        [  0.,   0.,   0.,  13.,   0.],
        [  0.,   0.,   6.,  25.,  64.]],

       [[  0.,   0.,   0.,   0.,   0.],
        [  0.,   0.,   0.,   0.,  24.],
        [  0.,   0.,   0.,   0.,   6.],
        [  0.,   0.,   0.,   0.,  17.]],

       [[  0.,  46.,   0.,  59.,   0.],
        [  0.,  83.,   0.,  62.,  23.],
        [ 94.,   0.,   0.,  25.,   0.],
        [  0.,   0.,  62.,   0.,   0.]]])

In [7]: xx = COO.from_numpy(x)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-0fae8213abde> in <module>()
----> 1 xx = COO.from_numpy(x)

NameError: name 'COO' is not defined

In [8]: xx = sparse.COO.from_numpy(x)

In [9]: yy = sparse.COO.from_numpy(y)

In [10]: xx
Out[10]: <COO: shape=(2, 3, 4), dtype=float64, nnz=8>

In [11]: yy
Out[11]: <COO: shape=(3, 4, 5), dtype=float64, nnz=18>

In [12]: sparse.tensordot(xx, yy, axes=((1, 2), (0, 1)))
Out[12]: <COO: shape=(2, 5), dtype=float64, nnz=8>

In [13]: np.array(_)
Out[13]: 
array([[ 1222.,  3634.,     0.,  4986.,  1512.],
       [ 2156.,  7918.,     0.,  5053.,  1426.]])

In [14]: xx.sum(axis=(0, 2))
Out[14]: <COO: shape=(3,), dtype=float64, nnz=3>

In [15]: np.array(_)
Out[15]: array([ 170.,  176.,  154.])

In [16]: xx.sum(axis=(0, 2), keepdims=True)
Out[16]: <COO: shape=(1, 3, 1), dtype=float64, nnz=3>

In [17]: np.array(_)
Out[17]: 
array([[[ 170.],
        [ 176.],
        [ 154.]]])

I did most of what I perceive to be the challenging things. However many simple things still don’t work:

Arithmetic
Elemwise ufuncs (though this should be easy if someone understands the protocol here (see xfailed tests))
Tensordot/binary operators with other arrays (we need some binary-dispatch solution)

Review of https://github.com/mrocklin/sparse would be appreciated.

0reactions

perimosocordiaecommented, Apr 18, 2017

This makes three sparse ndarray-like libraries (that I know of):

my sparray: linearized+sorted coordinate format
@ev-br’s sparr: C++ map backend (similar to scipy’s DOK format)
@mrocklin’s sparse: separate coordinate format

All three attempt to follow the numpy array API where possible, with varying degrees of coverage. Perhaps we should try to pull them into one package, to avoid duplicated effort?