question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Minimal example with sparse arrays

See original GitHub issue

I was wondering if there is a minimal example of using sparse arrays with Dask, along the lines of,

import dask.array as da
y = da.from_array(x, chunks=(100))
y.sum().compute()

where x is a 2D sparse array saved on disk, then partially loaded by some package supporting Numpy slicing syntax. I am aware of the Sparse array section in the documentation, however, I’m not certain how the proposed approach of swapping out in-memory numpy arrays with in-memory sparse arrays could apply with sparse arrays saved on disk.

I have posted a question on SO about it earlier, without receiving any suggestions on how this could be achievable. There I focused on scipy CSR arrays, loaded with joblib.load(.. , mmap_mode="r") because this allows partial loading of sparse arrays on disk and supports numpy slicing. However, I’m open to any other suggestions regarding the sparse array format / io package (e.g. using the mrocklin/sparse package).

I find the suggestions in the documentation about the storage format very useful (e.g. parquet for dataframes, more loosely hdf5 for dense arrays, etc). Is there any consensus yet as to what format could be used (possibly in the future) to store sparse arrays with dask? I’m not sure if anyone has worked on a language independent sparse array format (akin to feather or parquet for dataframes), or if it’s possible to read/write the distributed sparse format from Spark in Python without installing PySpark…

If this is not yet possible in dask, is there a development roadmap for getting there? In particular, there were a number of issues on this, but at present, all of them are closed.

Thank you. Sorry for all the questions.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:1
  • Comments:8 (7 by maintainers)

github_iconTop GitHub Comments

3reactions
mrocklincommented, Jul 27, 2017

Two changes:

  1. chunks is short for chunksize. You’ll want to include the shape of the chunks you with to cut out
  2. use the asarray=False keyword argument to from_array to stop coercion to numpy arrays
In [1]: import dask

In [2]: import dask.array as da

In [3]: import sparse

In [4]: x = sparse.COO({(10000, 10000): 1})

In [5]: x = da.from_array(x, chunks=(1000, 1000), asarray=False)

In [6]: x.compute()
Out[6]: <COO: shape=(10001, 10001), dtype=int64, nnz=1, sorted=False, duplicates=False>
0reactions
mrocklincommented, Jul 25, 2019

Ideally you would load your data lazily as tasks, perhaps using something like dask.delayed and da.from_delayed. This blogpost might be helpful: https://blog.dask.org/2019/06/20/load-image-data

If you absolutely need to load data locally then you might consider using scatter and then da.from_delayed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Sparse Matrix and its representations | Set 1 (Using Arrays ...
Representing a sparse matrix by a 2D array leads to wastage of lots of memory as zeroes in the matrix are of no...
Read more >
Sparse Lightweight Arrays and Matrices - slam
Data structures and operators for sparse arrays based on a representation by index matrix and value vector.
Read more >
Data Structures Tutorials - Sparse Matrix with an example
What is Sparse Matrix? In computer programming, a matrix can be defined with a 2-dimensional array. Any array with 'm' columns and 'n'...
Read more >
Writing Sparse Arrays — TileDB 1.6.3 documentation
Writing to a sparse array¶ ... Let us revisit the quickstart_sparse example of tutorial Sparse Arrays. Here is how we wrote to the...
Read more >
Working with Sparse Arrays
SparseArray can accept a list of rules to form a sparse array. These rules hold the indices and values of nonzero elements. In...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found