Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

IncrementalPCA fails if data size % batch size < n_components

See original GitHub issue

Description

IncrementalPCA throwsn_components=%r must be less or equal to the batch number of samples %d

The error occurs because the last batch generated by utils.gen_batch may be smaller than batch_size.

Steps/Code to Reproduce

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA, IncrementalPCA
   
iris = load_iris()
X = iris.data[:101]
ipca = IncrementalPCA(n_components=2, batch_size=10)
X_ipca = ipca.fit_transform(X)

I reduced the iris data to 101 instances, so the last batch has only a single data instance, which is less than the number of components.

As far as I see, none of the current unit tests run into this. (test_incremental_pca_batch_signs could, if the code that raises the exception would compare self.n_components_ with n_samples - which it should, but doesn’t).

Skipping the last batch if it is to small, that is, changing

        for batch in gen_batches(n_samples, self.batch_size_):
                self.partial_fit(X[batch], check_input=False)

        for batch in gen_batches(n_samples, self.batch_size_):
            if self.n_components is None \
                    or X[batch].shape[0] >= self.n_components:
                self.partial_fit(X[batch], check_input=False)

fixes the problem. @kastnerkyle, please confirm that this solution seems OK before I go preparing the PR and tests.

Expected Results

No error is thrown.

Actual Results

ValueError: n_components=2 must be less or equal to the batch number of samples 1.

Versions

Darwin-18.0.0-x86_64-i386-64bit
Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:14:59)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.15.2
SciPy 1.1.0
Scikit-Learn 0.20.0

Issue Analytics

State:
Created 5 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

jnothmancommented, Oct 13, 2018

I don’t know of anyone working on it, but I would consider it a blocker for 0.20.1 as it’s an important regression that should not be hard to fix

0reactions

mingglicommented, Oct 14, 2018

Hi, if this issue is still open for a fix, will try to look at this issue and produce a PR. 👍

Top Results From Across the Web

Incremental PCA on big data - Stack Overflow

You program is probably failing in trying to load the entire dataset into RAM. 32 bits per float32 × 1,000,000 × 1000 is...

sklearn.decomposition.IncrementalPCA

Depending on the size of the input data, this algorithm can be much more memory efficient than a PCA, and allows sparse input....

Is there an incremental dimensionality reduction algorithm that ...

Is there an incremental dimensionality reduction algorithm that can handle batch size less than number of components to be reduced? Ask Question.

2.5. Decomposing signals in components (matrix factorization ...

The biggest limitation is that PCA only supports batch processing, which means all of the data to be processed must fit in main...

Incremental Principal Component Analysis - SciTePress

Figure 3: Application of the incremental algorithm to the sample data-set. The uppermost picture shows the 27 principal components computed with the batch...