Does randomized TruncatedSVD really work on big, non-skinny datasets?
See original GitHub issueI am trying to calculate a TruncatedSVD on data that is too big to fit in memory. The original sklearn.decomposition
implementation does not work in any case, so I have been trying to use the dask-ml version (that is based on the dask.array.linalg.svd_compressed
algorithm).
Here is a minimal code that uses data similar to mine and gets killed because of too much RAM; depending on your locally available RAM the dimensions of the dataset might have to be adapted.
import dask_ml.decomposition ## version 1.2.0
import dask.array as da ## version 2.9.2
import numpy as np
# Create Random data
bigData = da.random.random((70000, 500000)).astype(np.float32).rechunk('auto')
# Fit the SVD
svd = dask_ml.decomposition.TruncatedSVD(4000, "randomized")
svd.fit(test)
smallDat = svd.transform(test)
Maybe this is connected to #401 or I just don’t understand the use cases of dask? Or maybe I am chunking my data wrong?
Thanks in advance for any help!
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (5 by maintainers)
Top Results From Across the Web
sklearn.decomposition.TruncatedSVD
This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not ...
Read more >Recommender System — singular value decomposition (SVD ...
Truncated SVD can deal with sparse matrix to generate features' ... Random sample the rating dataset and generate the movie features with ...
Read more >Range-Net: A High Precision Neural SVD | OpenReview
Experiments on large dataset will make this work more solid. It is not fair to compare RangeNet with SketchSVD, RangeNet just produces the...
Read more >Beginners Guide To Truncated SVD For Dimensionality ...
truncation SVD is a popular method for dimensionality reduction. However, it works better with sparse data. A given m⤫n matrix, truncated ...
Read more >Difference between scikit-learn implementations of PCA and ...
If the data is already centered, those two classes will do the same. In practice TruncatedSVD is useful on large sparse datasets which...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
The compressed solution can work on non-skinny datasets, yes. If your data does not fit in memory I recommend using the compute=True option. You will also probably have to be mindful of the chunking of the data in order for the matrix multiplies to fit nicely in small memory (this is hard). I started writing a blogpost on this topic here but never finished: https://github.com/dask/dask-blog/pull/38/files
On Sat, Jan 25, 2020 at 4:21 AM Tom Augspurger notifications@github.com wrote:
(I’m glad that things are working for you)
On Wed, Jan 29, 2020 at 9:45 AM Matthew Rocklin mrocklin@gmail.com wrote: