Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SparsePCA incorrectly scales results in .transform()

See original GitHub issue

Description

When using SparsePCA, the transform() method incorrectly scales the results based on the number of rows in the data matrix passed.

Proposed Fix

I am regrettably unable to do a pull request from where I sit. The issue is with this chunk of code, as of writing this at line number 179 in sparse_pca.py:

        U = ridge_regression(self.components_.T, X.T, ridge_alpha,
                             solver='cholesky')
        s = np.sqrt((U ** 2).sum(axis=0))
        s[s == 0] = 1
        U /= s
        return U

I honestly do not understand the details of the chosen implementation of SparsePCA. Depending on the objectives of the class, making use of the features as significant for unseen examples requires one of two modifications. Either (a) learn the scale factor s from the training data (i.e., make it an instance attribute like .scale_factor_), or (b) use .mean(axis=0) instead of .sum(axis=0) to remove the number-of-examples dependency.

Steps/Code to Reproduce

from sklearn.decomposition import SparsePCA
import numpy as np


def get_data( count, seed ):
    np.random.seed(seed)
    col1 = np.random.random(count)
    col2 = np.random.random(count)

    data = np.hstack([ a[:,np.newaxis] for a in [
        col1 + .01*np.random.random(count),
        -col1 + .01*np.random.random(count),
        2*col1 + col2 + .01*np.random.random(count),
        col2 + .01*np.random.random(count),
        ]])
    return data


train = get_data(1000,1)
spca = SparsePCA(max_iter=20)
results_train = spca.fit_transform( train )

test = get_data(10,1)
results_test = spca.transform( test )

print( "Training statistics:" )
print( "  mean: %12.3f" % results_train.mean() )
print( "   max: %12.3f" % results_train.max() )
print( "   min: %12.3f" % results_train.min() )
print( "Testing statistics:" )
print( "  mean: %12.3f" % results_test.mean() )
print( "   max: %12.3f" % results_test.max() )
print( "   min: %12.3f" % results_test.min() )

Output:

Training statistics:
  mean:       -0.009
   max:        0.067
   min:       -0.080
Testing statistics:
  mean:       -0.107
   max:        0.260
   min:       -0.607

Expected Results

The test results min/max values are on the same scale as the training results.

Actual Results

The test results min/max values are much larger than the training results, because fewer examples were used. It is trivial to repeat this process with various sizes of training and testing data to see the relationship.

Issue Analytics

State:
Created 6 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

lestevecommented, Sep 15, 2017

This seems like a bug indeed although I don’t know much about SparsePCA. I quickly looked at it and the PCA does not give the same results as the SparsePCA with alpha=0 and ridge_alpha=0 (whether they should or not I am not sure). One thing I noticed is that spca.components_ are not normalized (in contrary to pca.components_) and their norms seems to depend on the size of the data as well so maybe there was some logic behind the transform scaling …

0reactions

venecommented, Jul 16, 2018

Indeed the result with no regularization will only match PCA up to rotation & scaling because they are constrained differently (our SPCA does not have orthonormal constraints). I suspect they should have equal reconstruction MSE, right?

According to this discussion I agree the current transform is indeed wrong. As far as I understand the proposed fix is about .fit, not .transform-- when you say “leave things untouched” you mean other than fixing the .transform bug? If not, with scale_components=True will transform be normalized twice?

(Disclaimer: I’m missing a lot of context and forgot everything about the code)

Top Results From Across the Web

SparsePCA in sklearn not working properly? - Stack Overflow

If I use the default from SparsePCA the result will still be incorrect. PCA_model = PCA(n_components=64) PCA_model.fit(X) Z = PCA_model.

Scale-Invariant Sparse PCA on High Dimensional Meta ...

We propose a semiparametric method for conducting scale-invariant sparse principal component analysis (PCA) on high dimensional non-Gaussian data.

sklearn.decomposition.SparsePCA

Sparse Principal Components Analysis (SparsePCA). ... Amount of ridge shrinkage to apply in order to improve conditioning when calling the transform method.

Super-sparse principal component analyses for high ...

Background. Principal component analysis (PCA) has gained popularity as a method for the analysis of high-dimensional genomic data.

Normalizing vs Scaling before PCA - Cross Validated

The correct term for the scaling you mean is z-standardizing (or just "standardizing"). It is center-then-scale. · Hey thank you @ttnphns I wish ......