Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Does randomized TruncatedSVD really work on big, non-skinny datasets?

See original GitHub issue

I am trying to calculate a TruncatedSVD on data that is too big to fit in memory. The original sklearn.decomposition implementation does not work in any case, so I have been trying to use the dask-ml version (that is based on the dask.array.linalg.svd_compressed algorithm).

Here is a minimal code that uses data similar to mine and gets killed because of too much RAM; depending on your locally available RAM the dimensions of the dataset might have to be adapted.

import dask_ml.decomposition ## version 1.2.0
import dask.array as da  ## version 2.9.2
import numpy as np

# Create Random data
bigData = da.random.random((70000, 500000)).astype(np.float32).rechunk('auto')
# Fit the SVD
svd = dask_ml.decomposition.TruncatedSVD(4000, "randomized")
svd.fit(test)
smallDat = svd.transform(test)

Maybe this is connected to #401 or I just don’t understand the use cases of dask? Or maybe I am chunking my data wrong?

Thanks in advance for any help!

Issue Analytics

State:
Created 4 years ago
Comments:7 (5 by maintainers)

Top GitHub Comments

1reaction

mrocklincommented, Jan 25, 2020

The compressed solution can work on non-skinny datasets, yes. If your data does not fit in memory I recommend using the compute=True option. You will also probably have to be mindful of the chunking of the data in order for the matrix multiplies to fit nicely in small memory (this is hard). I started writing a blogpost on this topic here but never finished: https://github.com/dask/dask-blog/pull/38/files

On Sat, Jan 25, 2020 at 4:21 AM Tom Augspurger notifications@github.com wrote:

I’m not sure.

On Jan 25, 2020, at 06:09, Nik notifications@github.com wrote:

Yes, the default algorithm is “tsqr” which is only supported for tall and skinny datasets. However, the “randomized” version is based on the approximate algorithm based on dask.array.linalg.svd_compressed also referenced in the dask examples.

So either the svd_compressed does not work with that big data (I will test it), or there is a problem in how TruncatedSVD handles the data?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/604?email_source=notifications&email_token=AACKZTHITHV22D4CXBSCLBTQ7QVDNA5CNFSM4KLQBF7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJ43MZI#issuecomment-578401893, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTFXDPP77VC5ZQAZ5XTQ7QVDNANCNFSM4KLQBF7A .

0reactions

mrocklincommented, Jan 29, 2020

(I’m glad that things are working for you)

On Wed, Jan 29, 2020 at 9:45 AM Matthew Rocklin mrocklin@gmail.com wrote:

FYI Auto rechunking is unlikely to give optimal performance here. I encourage you to play around a bit.

On Tue, Jan 28, 2020 at 4:54 PM Nik notifications@github.com wrote:

Thank you @mrocklin https://github.com/mrocklin , that was exactly the information I needed. With the compute=True option, the dask.array.linalg.svd_compressed works on this huge array. Although it does take a while, but that is not the problem. Regarding the chunking, I could still use the rechunk(‘auto’) option with dask.config.set({‘array.chunk-size’: ‘1GiB’}). So no manual chunk-size specification needed.

The only issue now is that the TruncatedSVD-wrapper does not work, because the compute-option cannot be passed. But for me this is not so important.

Thanks again for the quick help.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/604?email_source=notifications&email_token=AACKZTECIFGR7CUN72G4W4LRADHURA5CNFSM4KLQBF7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKFSHFI#issuecomment-579543957, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTGT7WNZWG3QBLR5IGLRADHURANCNFSM4KLQBF7A .

Top Results From Across the Web

sklearn.decomposition.TruncatedSVD

This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not ...

Recommender System — singular value decomposition (SVD ...

Truncated SVD can deal with sparse matrix to generate features' ... Random sample the rating dataset and generate the movie features with ...

Range-Net: A High Precision Neural SVD | OpenReview

Experiments on large dataset will make this work more solid. It is not fair to compare RangeNet with SketchSVD, RangeNet just produces the...

Beginners Guide To Truncated SVD For Dimensionality ...

truncation SVD is a popular method for dimensionality reduction. However, it works better with sparse data. A given m⤫n matrix, truncated ...

Difference between scikit-learn implementations of PCA and ...

If the data is already centered, those two classes will do the same. In practice TruncatedSVD is useful on large sparse datasets which...