Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Sparse arrays support

See original GitHub issue

While sparse arrays are supported in dask, this issue aims to open the discussion on how this could be applied in the the context of dask-ml.

In particular, even if #5 about TF-IDF gets resolved, the estimators downstream in the pipeline would also need to support sparse arrays for this to be of any use. The simplest example of such pipeline could for instance be #115 : some text vectorizers, training a wrapped out-of code scikit-learn model (e.g. PartialMultinomialNB).

Potentially relevant estimators

TruncatedSVD, text vectorizer, some estimators in dask_ml.preprocessing, and wrapped scikit learn models that support incremental learning and sparse arrays natively.

Sparse array format

Several choices here,

should the mrocklin/sparse be added as a hard dependency as the package that works out of the box for sparse arrays?
should sparse arrays from scipy be wrapped to make them compatible in some limited fashion (if possible at all)?

In particular, as far as I understand, for the application at hand there is not need for ND sparse COO arrays provided by sparse package, 2D would be enough. Furthermore, scikit learns mostly uses CSR and while it’s relatively easy to convert between COO and CSR/CSC in the non distrubuted case, I’m not sure if it’s still the case for dask. Then there is the partitioning strategy (see next section).

Partitioning strategy

At least as far text vectorizers and incremental learning estimators are concerned, I imagine, it might be easier to partition the arrays row wise (using all the column width), which might also be natural with the CSR format.

File storage format

For instance once someone manages to compute a distributed TF-IDF , the question arises how can one store it on disk without loading everything in memory at once. At present, it doesn’t look like there is a canonical way to do this (https://github.com/dask/dask/issues/2562#issuecomment-318339397). https://github.com/zarr-developers/zarr/issues/152 might be relevant but it essentially stores dense format with compression, as far as I understand, which make difficult to do later computation with the data, I believe.

Just a few general thoughts. I’m not sure what’s your vision of the project in this respect is @mrocklin @TomAugspurger ; how much work this would represent or what might the easiest to start with…

cc @jorisvandenbossche @ogrisel

Issue Analytics

State:
Created 6 years ago
Comments:24 (14 by maintainers)

Top GitHub Comments

2reactions

mrocklincommented, Jan 25, 2018

Long term it would be nice to see a scipy.sparse csr class that satisfied the np.ndarray rather than np.matrix interface. Then it could operate nicely with dask array while still having the CSR layout. This is discussed in more depth in https://github.com/scipy/scipy/issues/8162

To be clear, I’m not saying a fully ndarray implementation, just the current implementation where n <= 2 that satisfies ndarray conventions. This is a broader issue for the rest of the community. My understanding is that sklearn devs would also appreciate a scipy.sparse.csr_array class.

1reaction

mrocklincommented, Jan 25, 2018

Well we would still need to be able to do the conversion in any case (e.g. to communicate with scikit-learn that would expect csr_matrix). Can’t we start with this and progressively re-implement operations to finally reach the final implementation that wouldn’t eventually use scipy.sparse ?

This is feasible. The only concern would be that we do a lot of work and never reach the functionality of scipy.sparse. It may be that copying the entire implementation over and then tweaking a couple of things to stop using the np.matrix interface is significantly easier to do.

To be clear, people can do whatever they want to do. I have thoughts on how I would or would not spend my time here, but that doesn’t stop others from exploring this topic. I doubt that my thoughts on this topic are entirely correct.

Top Results From Across the Web

Sparse Arrays - JavaScript: The Definitive Guide, 6th ... - O'Reilly

A sparse array is one in which the elements do not have contiguous indexes starting at 0. Normally, the length property of an...

Sparse matrix - Wikipedia

In numerical analysis and scientific computing, a sparse matrix or sparse array is a matrix in which most of the elements are zero....

Sparse Arrays · The Julia Language

Julia has support for sparse vectors and sparse matrices in the SparseArrays stdlib module. Sparse arrays are arrays that contain enough zeros that...

Working with Sparse Arrays - Wolfram Language Documentation

Operations on sparse matrices are all equivalent to the operations on dense matrices. For example, arithmetic is supported and a sparse array is...

Create sparse matrix - MATLAB sparse - MathWorks

This MATLAB function converts a full matrix into sparse form by squeezing out any zero elements. ... Complex Number Support: Yes ...