question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

.fit.transform != .fit_transform inconsistency in PCA results

See original GitHub issue

PCA’s fit_transform returns different results than the application of fit and transform methods individually. A piece of code that shows the inconsistency is given below.

import numpy as np
from sklearn.decomposition import PCA

nn = np.array([[0,1,2],[3,4,5],[6,7,8]])
pca = PCA(n_components=2, random_state=42)
print(pca.fit_transform(nn))

nn = np.array([[0,1,2],[3,4,5],[6,7,8]])
pca = PCA(n_components=2, random_state=42)
pca.fit(nn)
print(pca.transform(nn))

Please run the code with sklearn 0.23.2

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:11 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
glemaitrecommented, Dec 2, 2020

I looked quickly and it seems that we don’t do exactly the same low-level operations:

fit_transform: U *= S[:self.n_components_], U was computed from the linalg.svd during fit and we reuse it since it should be more efficient. Recalling that U S Vt = X and that Vt are the components and U S is then the scores.

transform: np.dot(X, self.components_.T)

I would not be surprise that we have small floating-point errors depending of the multiplication order, etc.

0reactions
illuminascentcommented, Jun 29, 2022

Here to report that this bug still persists in 1.1.1. I have also tested version 0.23.2 and 1.0.1, they all produced identical results.

Following is a mininum example to reproduce:

import numpy as np
from sklearn.decomposition import PCA

np.random.seed(22)
data = np.random.random([3000, 240])
pca = PCA(n_components=8)

ft = pca.fit_transform(data)
t = pca.transform(data)

diff = ft - t
rel_diff = diff / t

print(np.isclose(ft, t).mean())
#> 0.000125 
print(diff.min(), diff.max())
#> -0.10248672925028807 0.09267144836984448
print(rel_diff.min(), rel_diff.max())
#> -739.4328987087409 3968.5381008989184

Such difference is not acceptable and I actually found errors up to order of 10 in production data. Adding svd_solver='arpack' solves this inconsistency issue completely as stated by @ogrisel , however I would like to argue that needing to add a specific keyword argument just for PCA to work is not appropriate as people would likely expect fit+transform == fit_transform. Would you consider switching the default solver in PCA?

Read more comments on GitHub >

github_iconTop Results From Across the Web

python - fit_transform PCA inconsistent results - Stack Overflow
I am trying to do PCA from sklearn with n_components = 5 . I apply the dimensionality reduction on my data using fit_transform(data)...
Read more >
10. Common pitfalls and recommended practices - Scikit-learn
Below are some tips on avoiding data leakage: ... Never include test data when using the fit and fit_transform methods. Using all the...
Read more >
Manual computation of principal components disagrees with ...
I am trying to do PCA from sklearn with n_components = 5 . I apply the dimensionality reduction on my data using fit_transform(data)...
Read more >
What do the fit(), transform(), and fit_transform() methods ...
Fit let you train the function to understand the distribution of data or certain parameters from data such as min and max values....
Read more >
fit(), transform() and fit_transform() Methods in Python
The training data is scaled, and its scaling parameters are determined by applying a fit_transform() to the training data. The model we created,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found