PCA on sparse, noncentered data
See original GitHub issueI suppose this is more of a feature request than anything else. There are several implementations of PCA that can compute the decomposition on noncentered, sparse data, while the implementation here does not support sparse matrices at all.
A matlab implementation can be found here and a Python implementation here. So far, I’ve been using the Python implementation, but it’s missing some things and will eventually be deprecated (https://github.com/facebook/fbpca/pull/9).
I haven’t looked at the code or math too much, but as far as I’m aware, it’s just a matter of adding a term to randomized_range_finder
to account for centering.
Is this something that you guys are aware of, and is anyone working on this? This would an awesome feature to have.
Issue Analytics
- State:
- Created 5 years ago
- Comments:60 (26 by maintainers)
Top Results From Across the Web
Sparse common component analysis for ... - Springer Link
Principal component analysis (PCA) is a useful technique for extracting information in multivariate data. PCA enables us to consolidate the ...
Read more >Sparse common component analysis for ... - ResearchGate
The subspace method based on principal component analysis (PCA) is a useful tool for classifying and representing patterns. In order to learn ...
Read more >sklearn.decomposition.SparsePCA
Sparse Principal Components Analysis (SparsePCA). Finds the set of sparse components that can optimally reconstruct the data. The amount of sparseness is ...
Read more >Sparse common component analysis for ... - IDEAS/RePEc
We then propose sparse common component analysis based on sparse PCA to estimate ... high-dimensional datasets via noncentered principal component analysis.
Read more >Sparse Principal Component Analysis in Frequency Domain ...
We consider the sparse principal component analysis of high dimensional time se- ... The numerical results on both synthetic data and neural imaging...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I recently ran into needing this, and would like to resurrect implementation by taking over https://github.com/scikit-learn/scikit-learn/pull/18689 or staying close to the same approach of using
LinearOperator
as a black box input to ARPACK/LOBPCG/PROPACK/RandomizedSVD which @lobpcg suggested and @atarashansky implemented a POC for using @pavlin-policar’s implicit centering trick.Here’s a demo showing that at least the three solvers available through
scipy.sparse.linalg.svds
are compatible withLinearOperator
.svds
transforms its inputs into one anyway, but still it’s reassuring to see 😃(skip to the table at the bottom) https://gist.github.com/andportnoy/03c70436a8b830f90e99ab22640057fb
Perhaps I’ve been unclear. I’ll try to be more clear this time 😃
What we know We know that SVD is not PCA. It is different. But, we usually compute PCA by taking a matrix X, centering it, then applying SVD. This is equivalent to taking X, centering it, computing its covariance matrix C, then computing the eigenvalues and eigenvectors of C.
So this is nothing new. We typically compute PCA via the SVD. And when X is centered, SVD is the same as PCA. Otherwise they are different.
The problem Sometimes we can’t center our matrix. If we have a huge sparse matrix, centering it will eliminate the sparseness and make the matrix dense. This is often impossible if X is really, really big. It can’t fit into memory. So we can’t do PCA. There are incremental ways to do this (I believe incremental PCA is used for this), but this requires a lot of disk reads, so it’s slow.
We can still do SVD, but since X is not centered, this is not the same as PCA. PCA is often nicer because it’s easier to interpret things. It’s pretty hard to interpret an SVD (or I just might not be aware of it).
The solution In the implementations I referenced (I don’t know which paper the formulation comes from - I haven’t gone through the maths), they implement randomized SVD and randomized PCA. But, PCA is computed with a randomized method that never needs to center X. To emphasise this, it is possible to compute actual PCA without ever having to center the original, potentially huge X matrix.
To achieve this, we take the already implemented randomized SVD algorithm, and add a couple of negative terms, which account for centering. The changes to the algorithm are very minor.
The randomized SVD implemented here is great and could be extended with just a couple of terms, and we could have it compute both the SVD and PCA. Setting a single flag e.g.
apply_centering
torandomized_svd
could switch between the two.Conclusion I hope I’ve made this very clear this time. I also hope it’s clear that having this in scikit-learn would be very beneficial. Again, this is a more efficient way to compute PCA on sparse matrices, one that doesn’t require us to make them dense. PCA is already implemented in scikit-learn, so adding an implementation that supports sparse data seems like the natural next step. This is not the same as SVD. PCA is nicer than SVD because it has a clearer interpretation.
I could work on this if needed, but I am not at all familliar with the codebase, and have fairly limited time.