question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

sklearn.preprocessing.StandardScaler gets NaN variance when partial_fit with sparse data

See original GitHub issue

Describe the bug

When I feed a specific dataset (which is sparse) to sklearn.preprocessing.StandardScaler.partial_fit in a specific order, I get variance which is NaN although data does NOT contains any NaNs and is very small. When I convert the sparse arrays to dense, it works. When I change the order to feed the data, it works too.

Steps/Code to Reproduce

Please work with the data I attached. sparse_data.tar.gz

import scipy.sparse as sp
from sklearn import preprocessing

s0 = sp.load_npz('0.npz')
s1 = sp.load_npz('1.npz')

# Buggy behavior
ss0 = preprocessing.StandardScaler(with_mean=False)
ss0.partial_fit(s0)
print(ss0.var_)
ss0.partial_fit(s1)
print(ss0.var_)  # => gets NaN

# When use dence array, it works
ss1 = preprocessing.StandardScaler(with_mean=False)
ss1.partial_fit(s0.toarray())
print(ss1.var_)
ss1.partial_fit(s1.toarray())
print(ss1.var_)

# When change the order of data, it works
ss2 = preprocessing.StandardScaler(with_mean=False)
ss2.partial_fit(s1)
print(ss2.var_)
ss2.partial_fit(s0)
print(ss2.var_)

EDIT: Fix sample code around ss2

Expected Results

ss0.var_  # => [0.15896542]
ss1.var_  # => [0.15896542]
ss2.var_  # => [0.15896542]

Actual Results

ss0.var_  # => [nan]
ss1.var_  # => [0.15896542]
ss2.var_  # => [0.15896542]

Versions

I confirmed this issue in two different environments.

System:
    python: 3.7.3 (default, Apr 22 2019, 02:40:09)  [Clang 10.0.1 (clang-1001.0.46.4)]
executable: /usr/local/var/pyenv/versions/3.7.3/bin/python3
   machine: Darwin-19.3.0-x86_64-i386-64bit

Python dependencies:
       pip: 20.0.2
setuptools: 40.8.0
   sklearn: 0.22
     numpy: 1.18.0
     scipy: 1.4.1
    Cython: None
    pandas: 0.25.3
matplotlib: 3.1.2
    joblib: 0.14.1

Built with OpenMP: True
System:
    python: 3.7.6 (default, Feb 14 2020, 16:41:52)  [GCC 8.3.1 20190507 (Red Hat 8.3.1-4)]
executable: /home/***/ws/siml/.venv/bin/python3
   machine: Linux-4.18.0-147.5.1.el8_1.x86_64-x86_64-with-centos-8.1.1911-Core

Python dependencies:
       pip: 19.2.3
setuptools: 41.2.0
   sklearn: 0.22.1
     numpy: 1.18.1
     scipy: 1.4.1
    Cython: None
    pandas: 0.25.3
matplotlib: 3.1.3
    joblib: 0.14.1

Built with OpenMP: True

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:9 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
glemaitrecommented, Feb 17, 2020

OK this is solved in https://github.com/scikit-learn/scikit-learn/pull/16466 The bug was due to some badly thought implicit casting. It was tricky because the bug would only be triggered when the number of samples during a partial_fit call were greater than the total number of samples seen in the past.

0reactions
yellowshippocommented, Feb 18, 2020

Thank you all very much!!!

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.preprocessing.StandardScaler
This scaler can also be applied to sparse CSR or CSC matrices by passing ... If True, scale the data to unit variance...
Read more >
How do we standardize arrays with NaN?
This is where I get the error :Input contains NaN, infinity or a value too large for dtype('float64'). from sklearn.preprocessing import StandardScaler # ......
Read more >
Preprocessing with sklearn: a complete and comprehensive ...
The data contains obvious missing values expressed as not-a-number or 999. import numpy as np import pandas as pdX = pd.DataFrame( np.array([5,7 ...
Read more >
Assigning NaN to -1 after performing StandardScaler
It's not exactly that simple to deal with NaN values. It requires analyses of the data before taking any further step to deal...
Read more >
cuML API Reference — cuml 22.10.00 documentation
cuML follows a verbosity model similar to Scikit-learn's: The verbose ... from cuml.preprocessing import StandardScaler >>> import cupy as cp >>> data =...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found