Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

.fit.transform != .fit_transform inconsistency in PCA results

See original GitHub issue

PCA’s fit_transform returns different results than the application of fit and transform methods individually. A piece of code that shows the inconsistency is given below.

import numpy as np
from sklearn.decomposition import PCA

nn = np.array([[0,1,2],[3,4,5],[6,7,8]])
pca = PCA(n_components=2, random_state=42)
print(pca.fit_transform(nn))

nn = np.array([[0,1,2],[3,4,5],[6,7,8]])
pca = PCA(n_components=2, random_state=42)
pca.fit(nn)
print(pca.transform(nn))

Please run the code with sklearn 0.23.2

Issue Analytics

State:
Created 3 years ago
Comments:11 (8 by maintainers)

Top GitHub Comments

1reaction

glemaitrecommented, Dec 2, 2020

I looked quickly and it seems that we don’t do exactly the same low-level operations:

fit_transform: U *= S[:self.n_components_], U was computed from the linalg.svd during fit and we reuse it since it should be more efficient. Recalling that U S Vt = X and that Vt are the components and U S is then the scores.

transform: np.dot(X, self.components_.T)

I would not be surprise that we have small floating-point errors depending of the multiplication order, etc.

0reactions

illuminascentcommented, Jun 29, 2022

Here to report that this bug still persists in 1.1.1. I have also tested version 0.23.2 and 1.0.1, they all produced identical results.

Following is a mininum example to reproduce:

import numpy as np
from sklearn.decomposition import PCA

np.random.seed(22)
data = np.random.random([3000, 240])
pca = PCA(n_components=8)

ft = pca.fit_transform(data)
t = pca.transform(data)

diff = ft - t
rel_diff = diff / t

print(np.isclose(ft, t).mean())
#> 0.000125 
print(diff.min(), diff.max())
#> -0.10248672925028807 0.09267144836984448
print(rel_diff.min(), rel_diff.max())
#> -739.4328987087409 3968.5381008989184

Such difference is not acceptable and I actually found errors up to order of 10 in production data. Adding svd_solver='arpack' solves this inconsistency issue completely as stated by @ogrisel , however I would like to argue that needing to add a specific keyword argument just for PCA to work is not appropriate as people would likely expect fit+transform == fit_transform. Would you consider switching the default solver in PCA?