question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Sparse Dask Arrays

See original GitHub issue

I briefly ran into a situation with @moody-marlin where sparse arrays would have been useful. I also expect this to be of use for upcoming medium-term work, so I thought I’d put down some thoughts and encourage discussion.

First, narrowing scope, there are several kinds of sparse array layouts. Many arrays that arise from structured meshes are probably not within near-term scope. What is approachable however is a blocked sparse array that has more-or-less the same blocked layout as current dask.arrays, but swaps out numpy.ndarrays for sparse arrays, like scipy.sparse.csr_matrix.

I think that there are two ways to approach this (though I welcome others)

  1. We generalize the current dask.array code a bit so that we use generic functions like lambda x, y: x.dot(y) rather than np.dot. This enables more integration and probably less code duplication/effort but probably also increases complexity and thus maintenance on the common code path.
  2. We make a separate dask.SparseArray collection. We would have to recreate a lot of the metadata handling into a new sparse_atop function (satop?) but I suspect that almost all relevant sparse array operations could be written easily with this operation. The scope for sparse matrices is much smaller.

If the goal is strictly for 2d sparse matrices then I think that option 2 is a clear-ish winner. There are enough special cases on what scipy-sparse matrices support that I think trying to tie it on to the current dask.array code would be unpleasant long-term.

However, there is some (potentially funded) interest in building a general tensor factorization library on top of dask.array. This would motivate multi-dimensional sparse arrays. We would need to build an in-memory version of this first, which we could probably tailor to fit nicely into the requirements expected by relevant dask.array operations without much code bloat. I suspect that we could go a long way by only generalizing elementwise operations, reductions, tensordot, transpose, and maybe slicing.

Comments, criticism, and general thoughts welcome. cc @siebert @jcrist @pitrou @shoyer @eriknw

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:17 (13 by maintainers)

github_iconTop GitHub Comments

2reactions
mrocklincommented, Apr 16, 2017

OK, I have a small multidimensional sparse array library that handles transpose, reshape, tensordot, and simple reductions. It mostly generalizes scipy.sparse.coo_matrix and does clever reshapings to accomplish tensordot by backing it off of scipy.sparse.csr_matrix.dot. The reductions satisfy the numpy interface (keepdims works as usual).

https://github.com/mrocklin/sparse

In [1]: import numpy as np, random, sparse

In [2]: def random_x(shape, dtype=float):
   ...:     x = np.zeros(shape=shape, dtype=float)
   ...:     for i in range(max(5, np.prod(x.shape) // 3)):
   ...:         x[tuple(random.randint(0, d - 1) for d in x.shape)] = random.randint(0, 100)
   ...:     return x
   ...: 

In [3]: x = random_x((2, 3, 4))

In [4]: y = random_x((3, 4, 5))

In [5]: x
Out[5]: 
array([[[  0.,   0.,   0.,   0.],
        [ 34.,  63.,   0.,   0.],
        [ 79.,   0.,  13.,   0.]],

       [[  0.,  77.,  93.,   0.],
        [ 79.,   0.,   0.,   0.],
        [  0.,  62.,   0.,   0.]]])

In [6]: y
Out[6]: 
array([[[  0.,  52.,   0.,   0.,   0.],
        [ 28.,  36.,   0.,   0.,   0.],
        [  0.,   0.,   0.,  13.,   0.],
        [  0.,   0.,   6.,  25.,  64.]],

       [[  0.,   0.,   0.,   0.,   0.],
        [  0.,   0.,   0.,   0.,  24.],
        [  0.,   0.,   0.,   0.,   6.],
        [  0.,   0.,   0.,   0.,  17.]],

       [[  0.,  46.,   0.,  59.,   0.],
        [  0.,  83.,   0.,  62.,  23.],
        [ 94.,   0.,   0.,  25.,   0.],
        [  0.,   0.,  62.,   0.,   0.]]])

In [7]: xx = COO.from_numpy(x)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-0fae8213abde> in <module>()
----> 1 xx = COO.from_numpy(x)

NameError: name 'COO' is not defined

In [8]: xx = sparse.COO.from_numpy(x)

In [9]: yy = sparse.COO.from_numpy(y)

In [10]: xx
Out[10]: <COO: shape=(2, 3, 4), dtype=float64, nnz=8>

In [11]: yy
Out[11]: <COO: shape=(3, 4, 5), dtype=float64, nnz=18>

In [12]: sparse.tensordot(xx, yy, axes=((1, 2), (0, 1)))
Out[12]: <COO: shape=(2, 5), dtype=float64, nnz=8>

In [13]: np.array(_)
Out[13]: 
array([[ 1222.,  3634.,     0.,  4986.,  1512.],
       [ 2156.,  7918.,     0.,  5053.,  1426.]])

In [14]: xx.sum(axis=(0, 2))
Out[14]: <COO: shape=(3,), dtype=float64, nnz=3>

In [15]: np.array(_)
Out[15]: array([ 170.,  176.,  154.])

In [16]: xx.sum(axis=(0, 2), keepdims=True)
Out[16]: <COO: shape=(1, 3, 1), dtype=float64, nnz=3>

In [17]: np.array(_)
Out[17]: 
array([[[ 170.],
        [ 176.],
        [ 154.]]])

I did most of what I perceive to be the challenging things. However many simple things still don’t work:

  1. Arithmetic
  2. Elemwise ufuncs (though this should be easy if someone understands the protocol here (see xfailed tests))
  3. Tensordot/binary operators with other arrays (we need some binary-dispatch solution)

Review of https://github.com/mrocklin/sparse would be appreciated.

0reactions
perimosocordiaecommented, Apr 18, 2017

This makes three sparse ndarray-like libraries (that I know of):

  • my sparray: linearized+sorted coordinate format
  • @ev-br’s sparr: C++ map backend (similar to scipy’s DOK format)
  • @mrocklin’s sparse: separate coordinate format

All three attempt to follow the numpy array API where possible, with varying degrees of coverage. Perhaps we should try to pull them into one package, to avoid duplicated effort?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Sparse Arrays - Dask documentation
Dask's Array supports mixing different kinds of in-memory arrays. This relies on the in-memory arrays knowing how to interact with each other when...
Read more >
Dask store/read a sparse matrix that doesn't fit in memory
You can't. Dask works with numpy arrays. A scipy sparse matrix is not a numpy array. It's attributes may be arrays. For example...
Read more >
dask_sparse_mtx — dask-sparse-mtx 0.1 documentation
This is a simple demo showing how to use dask delayed and dask arrays to multiply sparse matricies loaded in lazily into memory...
Read more >
Minimal example with sparse arrays · Issue #2562 · dask/dask
I was wondering if there is a minimal example of using sparse arrays with Dask, along the lines of, import dask.array as da...
Read more >
Dataframe from sparse array - Dask Forum - Discourse
I want to create a sparse array with dask without ever creating a dense matrix (at least not in memory).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found