Combining dask with sparse arrays
See original GitHub issueHi @hameerabbasi,
This is more of a discussion instead of an issue, so if this isn’t the right place for it let me know. I thought it might still be valuable to raise. Recently, I’ve been thinking about how to do distributed skeleton analysis with skan and Dask. Skan uses the sparse library under the hood.
Some things we considered that don’t work so well:
-
First, we thought about processing on the input data bit by bit, and combining many sparse arrays into a bigger one at the end. But this didn’t seem like a good idea -
indptrscales linearly with the size of the largest index, regardless of how few non-zero values are contained in the array. -
Next, we thought about trying to pass output from dask directly into the construction of the sparse array (eg:
sparse.coo_matrix((data, (row, col)))). However, this isn’t currently possible to do. At line 148 of/sparse/coo.py, in the__init__function, theoperator.indexfunction doesn’t know how to handle dask arrays (that’s fair enough).
What we have that kinda works:
- Currently we rely on the (faulty) assumption that the data values will be sparse/small enough that we can bring the results of the dask computation into memory as a numpy array. Then we create the sparse array from that. If you’re curious, you can take a look at this gist to see (scroll right to the end for the sparse array construction).
Do you happen to have any advice? Is this type of workflow something that fits with the overall project goals of sparse?
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (4 by maintainers)

Top Related StackOverflow Question
Ok. In the case of an image skeleton to graph adjacency matrix, it might complicate things slightly. Results from a single chunk in the image need to be spread out over more than one chunk in the adjacency matrix. Possibly we’d want to collect the results from all chunks, then re-order them (dask really doesn’t do
argsortthough) and then put them into the graph adjacency matrix.This has been a good discussion, thanks!
Honestly, this is the first time someone confused the two.