question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Incremental PCA - ValueError: array must not contain infs or NaNs

See original GitHub issue

I’m trying to use IncrementalPCA from sklearn.decomposition. My code couldn’t really be simpler:

from sklearn.decomposition import IncrementalPCA
import pandas as pd

with open('C:/My/File/Path/file.csv', 'r') as fp:
    data = pd.read_csv(fp)

ipca = IncrementalPCA(n_components=4)
ipca.fit(data)

but this is how it finishes when launched:

C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\sklearn\decomposition\_incremental_pca.py:293: RuntimeWarning: overflow encountered in long_scalars
  np.sqrt((self.n_samples_seen_ * n_samples) /
C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\sklearn\decomposition\_incremental_pca.py:293: RuntimeWarning: invalid value encountered in sqrt
  np.sqrt((self.n_samples_seen_ * n_samples) /
Traceback (most recent call last):
File "C:/Users/myuser/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/scratch_9.py", line 6, in <module>
  ipca.fit(data)
File "C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\sklearn\decomposition\_incremental_pca.py", line 215, in fit
  self.partial_fit(X_batch, check_input=False)
File "C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\sklearn\decomposition\_incremental_pca.py", line 298, in partial_fit
  U, S, V = linalg.svd(X, full_matrices=False)
File "C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\scipy\linalg\decomp_svd.py", line 106, in svd
  a1 = _asarray_validated(a, check_finite=check_finite)
File "C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\scipy\_lib\_util.py", line 263, in _asarray_validated
  a = toarray(a)
File "C:\Users\myuser\PycharmProjects\mushu\venv\lib\site-packages\numpy\lib\function_base.py", line 498, in asarray_chkfinite
  raise ValueError(
ValueError: array must not contain infs or NaNs

Process finished with exit code 1

I already checked:

  • There is no NaN, infinite or negative anywhere in my data
  • I had scikit-learn v0.22.2.post1, I updated to 0.23.1, no difference
  • If I use PCA instead of IncrementalPCA leaving everything else the same, everything works fine, no warnings, no errors, all good
  • Tried using both data = pd.read_csv(fp, dtype = 'Int64') and data = pd.read_csv(fp, dtype = np.float64) with no difference in results.
  • There were similar issues in previous versions, but they refer to versions around 0.16/0.17, most were with more complex code and afaik all were fixed around those versions

My data, exactly as I feed them to the above code. This are really just 243 columns x 2.000.000 rows of 0s and 1s. https://drive.google.com/file/d/1JBIliADt9TViTk8qjnmIS3RFEO934dY6/view?usp=sharing

Update Seems like the issue is related with the dataset size. If I try fitting to a smaller portion everything works fine. This is until I get around 1800000 rows. That’s where the error starts showing.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:17 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
allanbutlercommented, Jun 29, 2020

Glad the change worked. I’m still working on writing some tests before submitting another push.

1reaction
johnny-muellercommented, Jun 29, 2020

Now that I have uninstalled and reinstalled everything including the change of the push request. The error is really not present any more. Thanks for the feedback

Read more comments on GitHub >

github_iconTop Results From Across the Web

PCA in Sklearn - ValueError: array must not contain infs or NaNs
I am trying to use grid search to choose the number of principal components of the data before fitting into a linear regression....
Read more >
PCA scikit-learn - ValueError: array must not contain infs or ...
My guess is the issue is a combination of whiten=True and the largest value equal to the largest possible value of a float64...
Read more >
cggh/pygenomics - Gitter
Expect further incremental updates as we migrate remainder of packages to conda recipes. ... ValueError: array must not contain infs or NaNs. Alistair...
Read more >
Getting 'ValueError: array must not contain infs or NaNs' even ...
I'm getting this ValueError: array must not contain infs or NaNs even after I used np.nan_to_num(). As you can see in the code...
Read more >
sklearn.decomposition.IncrementalPCA
IncrementalPCA: Incremental PCA Incremental PCA. ... This algorithm has constant memory complexity, on the order of batch_size * n_features , enabling use ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found