Minimal example with sparse arrays
See original GitHub issueI was wondering if there is a minimal example of using sparse arrays with Dask, along the lines of,
import dask.array as da
y = da.from_array(x, chunks=(100))
y.sum().compute()
where x
is a 2D sparse array saved on disk, then partially loaded by some package supporting Numpy slicing syntax. I am aware of the Sparse array section in the documentation, however, I’m not certain how the proposed approach of swapping out in-memory numpy arrays with in-memory sparse arrays could apply with sparse arrays saved on disk.
I have posted a question on SO about it earlier, without receiving any suggestions on how this could be achievable. There I focused on scipy CSR arrays, loaded with joblib.load(.. , mmap_mode="r")
because this allows partial loading of sparse arrays on disk and supports numpy slicing. However, I’m open to any other suggestions regarding the sparse array format / io package (e.g. using the mrocklin/sparse package).
I find the suggestions in the documentation about the storage format very useful (e.g. parquet for dataframes, more loosely hdf5 for dense arrays, etc). Is there any consensus yet as to what format could be used (possibly in the future) to store sparse arrays with dask? I’m not sure if anyone has worked on a language independent sparse array format (akin to feather or parquet for dataframes), or if it’s possible to read/write the distributed sparse format from Spark in Python without installing PySpark…
If this is not yet possible in dask, is there a development roadmap for getting there? In particular, there were a number of issues on this, but at present, all of them are closed.
Thank you. Sorry for all the questions.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:8 (7 by maintainers)
Top GitHub Comments
Two changes:
asarray=False
keyword argument to from_array to stop coercion to numpy arraysIdeally you would load your data lazily as tasks, perhaps using something like dask.delayed and da.from_delayed. This blogpost might be helpful: https://blog.dask.org/2019/06/20/load-image-data
If you absolutely need to load data locally then you might consider using scatter and then da.from_delayed.