question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

IncrementalPCA fails if data size % batch size < n_components

See original GitHub issue

Description

IncrementalPCA throwsn_components=%r must be less or equal to the batch number of samples %d

The error occurs because the last batch generated by utils.gen_batch may be smaller than batch_size.

Steps/Code to Reproduce

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA, IncrementalPCA
   
iris = load_iris()
X = iris.data[:101]
ipca = IncrementalPCA(n_components=2, batch_size=10)
X_ipca = ipca.fit_transform(X)

I reduced the iris data to 101 instances, so the last batch has only a single data instance, which is less than the number of components.

As far as I see, none of the current unit tests run into this. (test_incremental_pca_batch_signs could, if the code that raises the exception would compare self.n_components_ with n_samples - which it should, but doesn’t).

Skipping the last batch if it is to small, that is, changing

        for batch in gen_batches(n_samples, self.batch_size_):
                self.partial_fit(X[batch], check_input=False)

to

        for batch in gen_batches(n_samples, self.batch_size_):
            if self.n_components is None \
                    or X[batch].shape[0] >= self.n_components:
                self.partial_fit(X[batch], check_input=False)

fixes the problem. @kastnerkyle, please confirm that this solution seems OK before I go preparing the PR and tests.

Expected Results

No error is thrown.

Actual Results

ValueError: n_components=2 must be less or equal to the batch number of samples 1.

Versions

Darwin-18.0.0-x86_64-i386-64bit
Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:14:59)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.15.2
SciPy 1.1.0
Scikit-Learn 0.20.0

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jnothmancommented, Oct 13, 2018

I don’t know of anyone working on it, but I would consider it a blocker for 0.20.1 as it’s an important regression that should not be hard to fix

0reactions
mingglicommented, Oct 14, 2018

Hi, if this issue is still open for a fix, will try to look at this issue and produce a PR. 👍

Read more comments on GitHub >

github_iconTop Results From Across the Web

Incremental PCA on big data - Stack Overflow
You program is probably failing in trying to load the entire dataset into RAM. 32 bits per float32 × 1,000,000 × 1000 is...
Read more >
sklearn.decomposition.IncrementalPCA
Depending on the size of the input data, this algorithm can be much more memory efficient than a PCA, and allows sparse input....
Read more >
Is there an incremental dimensionality reduction algorithm that ...
Is there an incremental dimensionality reduction algorithm that can handle batch size less than number of components to be reduced? Ask Question.
Read more >
2.5. Decomposing signals in components (matrix factorization ...
The biggest limitation is that PCA only supports batch processing, which means all of the data to be processed must fit in main...
Read more >
Incremental Principal Component Analysis - SciTePress
Figure 3: Application of the incremental algorithm to the sample data-set. The uppermost picture shows the 27 principal components computed with the batch...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found