Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PCA on sparse, noncentered data

See original GitHub issue

I suppose this is more of a feature request than anything else. There are several implementations of PCA that can compute the decomposition on noncentered, sparse data, while the implementation here does not support sparse matrices at all.

A matlab implementation can be found here and a Python implementation here. So far, I’ve been using the Python implementation, but it’s missing some things and will eventually be deprecated (https://github.com/facebook/fbpca/pull/9).

I haven’t looked at the code or math too much, but as far as I’m aware, it’s just a matter of adding a term to randomized_range_finder to account for centering.

Is this something that you guys are aware of, and is anyone working on this? This would an awesome feature to have.

Issue Analytics

State:
Created 5 years ago
Comments:60 (26 by maintainers)

Top GitHub Comments

3reactions

andportnoycommented, Aug 28, 2022

I recently ran into needing this, and would like to resurrect implementation by taking over https://github.com/scikit-learn/scikit-learn/pull/18689 or staying close to the same approach of using LinearOperator as a black box input to ARPACK/LOBPCG/PROPACK/RandomizedSVD which @lobpcg suggested and @atarashansky implemented a POC for using @pavlin-policar’s implicit centering trick.

Here’s a demo showing that at least the three solvers available through scipy.sparse.linalg.svds are compatible with LinearOperator. svds transforms its inputs into one anyway, but still it’s reassuring to see 😃

(skip to the table at the bottom) https://gist.github.com/andportnoy/03c70436a8b830f90e99ab22640057fb

3reactions

pavlin-policarcommented, Dec 19, 2018

Perhaps I’ve been unclear. I’ll try to be more clear this time 😃

What we know We know that SVD is not PCA. It is different. But, we usually compute PCA by taking a matrix X, centering it, then applying SVD. This is equivalent to taking X, centering it, computing its covariance matrix C, then computing the eigenvalues and eigenvectors of C.

So this is nothing new. We typically compute PCA via the SVD. And when X is centered, SVD is the same as PCA. Otherwise they are different.

The problem Sometimes we can’t center our matrix. If we have a huge sparse matrix, centering it will eliminate the sparseness and make the matrix dense. This is often impossible if X is really, really big. It can’t fit into memory. So we can’t do PCA. There are incremental ways to do this (I believe incremental PCA is used for this), but this requires a lot of disk reads, so it’s slow.

We can still do SVD, but since X is not centered, this is not the same as PCA. PCA is often nicer because it’s easier to interpret things. It’s pretty hard to interpret an SVD (or I just might not be aware of it).

The solution In the implementations I referenced (I don’t know which paper the formulation comes from - I haven’t gone through the maths), they implement randomized SVD and randomized PCA. But, PCA is computed with a randomized method that never needs to center X. To emphasise this, it is possible to compute actual PCA without ever having to center the original, potentially huge X matrix.

To achieve this, we take the already implemented randomized SVD algorithm, and add a couple of negative terms, which account for centering. The changes to the algorithm are very minor.

The randomized SVD implemented here is great and could be extended with just a couple of terms, and we could have it compute both the SVD and PCA. Setting a single flag e.g. apply_centering to randomized_svd could switch between the two.

Conclusion I hope I’ve made this very clear this time. I also hope it’s clear that having this in scikit-learn would be very beneficial. Again, this is a more efficient way to compute PCA on sparse matrices, one that doesn’t require us to make them dense. PCA is already implemented in scikit-learn, so adding an implementation that supports sparse data seems like the natural next step. This is not the same as SVD. PCA is nicer than SVD because it has a clearer interpretation.

I could work on this if needed, but I am not at all familliar with the codebase, and have fairly limited time.

Top Results From Across the Web

Sparse common component analysis for ... - Springer Link

Principal component analysis (PCA) is a useful technique for extracting information in multivariate data. PCA enables us to consolidate the ...

Sparse common component analysis for ... - ResearchGate

The subspace method based on principal component analysis (PCA) is a useful tool for classifying and representing patterns. In order to learn ...

sklearn.decomposition.SparsePCA

Sparse Principal Components Analysis (SparsePCA). Finds the set of sparse components that can optimally reconstruct the data. The amount of sparseness is ...

Sparse common component analysis for ... - IDEAS/RePEc

We then propose sparse common component analysis based on sparse PCA to estimate ... high-dimensional datasets via noncentered principal component analysis.

Sparse Principal Component Analysis in Frequency Domain ...

We consider the sparse principal component analysis of high dimensional time se- ... The numerical results on both synthetic data and neural imaging...