Sparse arrays support
See original GitHub issueWhile sparse arrays are supported in dask, this issue aims to open the discussion on how this could be applied in the the context of dask-ml.
In particular, even if #5 about TF-IDF gets resolved, the estimators downstream in the pipeline would also need to support sparse arrays for this to be of any use. The simplest example of such pipeline could for instance be #115 : some text vectorizers, training a wrapped out-of code scikit-learn model (e.g. PartialMultinomialNB
).
Potentially relevant estimators
TruncatedSVD
, text vectorizer, some estimators in dask_ml.preprocessing
, and wrapped scikit learn models that support incremental learning and sparse arrays natively.
Sparse array format
Several choices here,
- should the mrocklin/sparse be added as a hard dependency as the package that works out of the box for sparse arrays?
- should sparse arrays from scipy be wrapped to make them compatible in some limited fashion (if possible at all)?
In particular, as far as I understand, for the application at hand there is not need for ND sparse COO arrays provided by sparse
package, 2D would be enough. Furthermore, scikit learns mostly uses CSR and while it’s relatively easy to convert between COO and CSR/CSC in the non distrubuted case, I’m not sure if it’s still the case for dask. Then there is the partitioning strategy (see next section).
Partitioning strategy
At least as far text vectorizers and incremental learning estimators are concerned, I imagine, it might be easier to partition the arrays row wise (using all the column width), which might also be natural with the CSR format.
File storage format
For instance once someone manages to compute a distributed TF-IDF , the question arises how can one store it on disk without loading everything in memory at once. At present, it doesn’t look like there is a canonical way to do this (https://github.com/dask/dask/issues/2562#issuecomment-318339397). https://github.com/zarr-developers/zarr/issues/152 might be relevant but it essentially stores dense format with compression, as far as I understand, which make difficult to do later computation with the data, I believe.
Just a few general thoughts. I’m not sure what’s your vision of the project in this respect is @mrocklin @TomAugspurger ; how much work this would represent or what might the easiest to start with…
Issue Analytics
- State:
- Created 6 years ago
- Comments:24 (14 by maintainers)
Top GitHub Comments
Long term it would be nice to see a scipy.sparse csr class that satisfied the np.ndarray rather than np.matrix interface. Then it could operate nicely with dask array while still having the CSR layout. This is discussed in more depth in https://github.com/scipy/scipy/issues/8162
To be clear, I’m not saying a fully ndarray implementation, just the current implementation where
n <= 2
that satisfies ndarray conventions. This is a broader issue for the rest of the community. My understanding is that sklearn devs would also appreciate ascipy.sparse.csr_array
class.This is feasible. The only concern would be that we do a lot of work and never reach the functionality of scipy.sparse. It may be that copying the entire implementation over and then tweaking a couple of things to stop using the np.matrix interface is significantly easier to do.
To be clear, people can do whatever they want to do. I have thoughts on how I would or would not spend my time here, but that doesn’t stop others from exploring this topic. I doubt that my thoughts on this topic are entirely correct.