SparsePCA incorrectly scales results in .transform()
See original GitHub issueDescription
When using SparsePCA
, the transform()
method incorrectly scales the results based on the number of rows in the data matrix passed.
Proposed Fix
I am regrettably unable to do a pull request from where I sit. The issue is with this chunk of code, as of writing this at line number 179 in sparse_pca.py:
U = ridge_regression(self.components_.T, X.T, ridge_alpha,
solver='cholesky')
s = np.sqrt((U ** 2).sum(axis=0))
s[s == 0] = 1
U /= s
return U
I honestly do not understand the details of the chosen implementation of SparsePCA. Depending on the objectives of the class, making use of the features as significant for unseen examples requires one of two modifications. Either (a) learn the scale factor s
from the training data (i.e., make it an instance attribute like .scale_factor_
), or (b) use .mean(axis=0)
instead of .sum(axis=0)
to remove the number-of-examples dependency.
Steps/Code to Reproduce
from sklearn.decomposition import SparsePCA
import numpy as np
def get_data( count, seed ):
np.random.seed(seed)
col1 = np.random.random(count)
col2 = np.random.random(count)
data = np.hstack([ a[:,np.newaxis] for a in [
col1 + .01*np.random.random(count),
-col1 + .01*np.random.random(count),
2*col1 + col2 + .01*np.random.random(count),
col2 + .01*np.random.random(count),
]])
return data
train = get_data(1000,1)
spca = SparsePCA(max_iter=20)
results_train = spca.fit_transform( train )
test = get_data(10,1)
results_test = spca.transform( test )
print( "Training statistics:" )
print( " mean: %12.3f" % results_train.mean() )
print( " max: %12.3f" % results_train.max() )
print( " min: %12.3f" % results_train.min() )
print( "Testing statistics:" )
print( " mean: %12.3f" % results_test.mean() )
print( " max: %12.3f" % results_test.max() )
print( " min: %12.3f" % results_test.min() )
Output:
Training statistics:
mean: -0.009
max: 0.067
min: -0.080
Testing statistics:
mean: -0.107
max: 0.260
min: -0.607
Expected Results
The test results min/max values are on the same scale as the training results.
Actual Results
The test results min/max values are much larger than the training results, because fewer examples were used. It is trivial to repeat this process with various sizes of training and testing data to see the relationship.
Issue Analytics
- State:
- Created 6 years ago
- Comments:6 (5 by maintainers)
Top GitHub Comments
This seems like a bug indeed although I don’t know much about SparsePCA. I quickly looked at it and the PCA does not give the same results as the SparsePCA with alpha=0 and ridge_alpha=0 (whether they should or not I am not sure). One thing I noticed is that
spca.components_
are not normalized (in contrary topca.components_
) and their norms seems to depend on the size of the data as well so maybe there was some logic behind the transform scaling …Indeed the result with no regularization will only match PCA up to rotation & scaling because they are constrained differently (our SPCA does not have orthonormal constraints). I suspect they should have equal reconstruction MSE, right?
According to this discussion I agree the current transform is indeed wrong. As far as I understand the proposed fix is about .fit, not .transform-- when you say “leave things untouched” you mean other than fixing the .transform bug? If not, with scale_components=True will transform be normalized twice?
(Disclaimer: I’m missing a lot of context and forgot everything about the code)