.fit.transform != .fit_transform inconsistency in PCA results
See original GitHub issuePCA’s fit_transform returns different results than the application of fit and transform methods individually. A piece of code that shows the inconsistency is given below.
import numpy as np
from sklearn.decomposition import PCA
nn = np.array([[0,1,2],[3,4,5],[6,7,8]])
pca = PCA(n_components=2, random_state=42)
print(pca.fit_transform(nn))
nn = np.array([[0,1,2],[3,4,5],[6,7,8]])
pca = PCA(n_components=2, random_state=42)
pca.fit(nn)
print(pca.transform(nn))
Please run the code with sklearn 0.23.2
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (8 by maintainers)
Top Results From Across the Web
python - fit_transform PCA inconsistent results - Stack Overflow
I am trying to do PCA from sklearn with n_components = 5 . I apply the dimensionality reduction on my data using fit_transform(data)...
Read more >10. Common pitfalls and recommended practices - Scikit-learn
Below are some tips on avoiding data leakage: ... Never include test data when using the fit and fit_transform methods. Using all the...
Read more >Manual computation of principal components disagrees with ...
I am trying to do PCA from sklearn with n_components = 5 . I apply the dimensionality reduction on my data using fit_transform(data)...
Read more >What do the fit(), transform(), and fit_transform() methods ...
Fit let you train the function to understand the distribution of data or certain parameters from data such as min and max values....
Read more >fit(), transform() and fit_transform() Methods in Python
The training data is scaled, and its scaling parameters are determined by applying a fit_transform() to the training data. The model we created,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I looked quickly and it seems that we don’t do exactly the same low-level operations:
fit_transform
:U *= S[:self.n_components_]
,U
was computed from thelinalg.svd
duringfit
and we reuse it since it should be more efficient. Recalling thatU S Vt = X
and thatVt
are the components andU S
is then the scores.transform
:np.dot(X, self.components_.T)
I would not be surprise that we have small floating-point errors depending of the multiplication order, etc.
Here to report that this bug still persists in 1.1.1. I have also tested version 0.23.2 and 1.0.1, they all produced identical results.
Following is a mininum example to reproduce:
Such difference is not acceptable and I actually found errors up to order of 10 in production data. Adding
svd_solver='arpack'
solves this inconsistency issue completely as stated by @ogrisel , however I would like to argue that needing to add a specific keyword argument just for PCA to work is not appropriate as people would likely expect fit+transform == fit_transform. Would you consider switching the default solver in PCA?