Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Minimal example with sparse arrays

See original GitHub issue

I was wondering if there is a minimal example of using sparse arrays with Dask, along the lines of,

import dask.array as da
y = da.from_array(x, chunks=(100))
y.sum().compute()

where x is a 2D sparse array saved on disk, then partially loaded by some package supporting Numpy slicing syntax. I am aware of the Sparse array section in the documentation, however, I’m not certain how the proposed approach of swapping out in-memory numpy arrays with in-memory sparse arrays could apply with sparse arrays saved on disk.

I have posted a question on SO about it earlier, without receiving any suggestions on how this could be achievable. There I focused on scipy CSR arrays, loaded with joblib.load(.. , mmap_mode="r") because this allows partial loading of sparse arrays on disk and supports numpy slicing. However, I’m open to any other suggestions regarding the sparse array format / io package (e.g. using the mrocklin/sparse package).

I find the suggestions in the documentation about the storage format very useful (e.g. parquet for dataframes, more loosely hdf5 for dense arrays, etc). Is there any consensus yet as to what format could be used (possibly in the future) to store sparse arrays with dask? I’m not sure if anyone has worked on a language independent sparse array format (akin to feather or parquet for dataframes), or if it’s possible to read/write the distributed sparse format from Spark in Python without installing PySpark…

If this is not yet possible in dask, is there a development roadmap for getting there? In particular, there were a number of issues on this, but at present, all of them are closed.

Thank you. Sorry for all the questions.

Issue Analytics

State:
Created 6 years ago
Reactions:1
Comments:8 (7 by maintainers)

Top GitHub Comments

3reactions

mrocklincommented, Jul 27, 2017

Two changes:

chunks is short for chunksize. You’ll want to include the shape of the chunks you with to cut out
use the asarray=False keyword argument to from_array to stop coercion to numpy arrays

In [1]: import dask

In [2]: import dask.array as da

In [3]: import sparse

In [4]: x = sparse.COO({(10000, 10000): 1})

In [5]: x = da.from_array(x, chunks=(1000, 1000), asarray=False)

In [6]: x.compute()
Out[6]: <COO: shape=(10001, 10001), dtype=int64, nnz=1, sorted=False, duplicates=False>

0reactions

mrocklincommented, Jul 25, 2019

Ideally you would load your data lazily as tasks, perhaps using something like dask.delayed and da.from_delayed. This blogpost might be helpful: https://blog.dask.org/2019/06/20/load-image-data

If you absolutely need to load data locally then you might consider using scatter and then da.from_delayed.