IncrementalPCA fails if data size % batch size < n_components
See original GitHub issueDescription
IncrementalPCA
throwsn_components=%r must be less or equal to the batch number of samples %d
The error occurs because the last batch generated by utils.gen_batch
may be smaller than batch_size
.
Steps/Code to Reproduce
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA, IncrementalPCA
iris = load_iris()
X = iris.data[:101]
ipca = IncrementalPCA(n_components=2, batch_size=10)
X_ipca = ipca.fit_transform(X)
I reduced the iris data to 101 instances, so the last batch has only a single data instance, which is less than the number of components.
As far as I see, none of the current unit tests run into this. (test_incremental_pca_batch_signs
could, if the code that raises the exception would compare self.n_components_
with n_samples
- which it should, but doesn’t).
Skipping the last batch if it is to small, that is, changing
for batch in gen_batches(n_samples, self.batch_size_):
self.partial_fit(X[batch], check_input=False)
to
for batch in gen_batches(n_samples, self.batch_size_):
if self.n_components is None \
or X[batch].shape[0] >= self.n_components:
self.partial_fit(X[batch], check_input=False)
fixes the problem. @kastnerkyle, please confirm that this solution seems OK before I go preparing the PR and tests.
Expected Results
No error is thrown.
Actual Results
ValueError: n_components=2 must be less or equal to the batch number of samples 1.
Versions
Darwin-18.0.0-x86_64-i386-64bit
Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:14:59)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.15.2
SciPy 1.1.0
Scikit-Learn 0.20.0
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
I don’t know of anyone working on it, but I would consider it a blocker for 0.20.1 as it’s an important regression that should not be hard to fix
Hi, if this issue is still open for a fix, will try to look at this issue and produce a PR. 👍