Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

sklearn.preprocessing.StandardScaler gets NaN variance when partial_fit with sparse data

See original GitHub issue

Describe the bug

When I feed a specific dataset (which is sparse) to sklearn.preprocessing.StandardScaler.partial_fit in a specific order, I get variance which is NaN although data does NOT contains any NaNs and is very small. When I convert the sparse arrays to dense, it works. When I change the order to feed the data, it works too.

Steps/Code to Reproduce

Please work with the data I attached. sparse_data.tar.gz

import scipy.sparse as sp
from sklearn import preprocessing

s0 = sp.load_npz('0.npz')
s1 = sp.load_npz('1.npz')

# Buggy behavior
ss0 = preprocessing.StandardScaler(with_mean=False)
ss0.partial_fit(s0)
print(ss0.var_)
ss0.partial_fit(s1)
print(ss0.var_)  # => gets NaN

# When use dence array, it works
ss1 = preprocessing.StandardScaler(with_mean=False)
ss1.partial_fit(s0.toarray())
print(ss1.var_)
ss1.partial_fit(s1.toarray())
print(ss1.var_)

# When change the order of data, it works
ss2 = preprocessing.StandardScaler(with_mean=False)
ss2.partial_fit(s1)
print(ss2.var_)
ss2.partial_fit(s0)
print(ss2.var_)

EDIT: Fix sample code around ss2

Expected Results

ss0.var_  # => [0.15896542]
ss1.var_  # => [0.15896542]
ss2.var_  # => [0.15896542]

Actual Results

ss0.var_  # => [nan]
ss1.var_  # => [0.15896542]
ss2.var_  # => [0.15896542]

Versions

I confirmed this issue in two different environments.

System:
    python: 3.7.3 (default, Apr 22 2019, 02:40:09)  [Clang 10.0.1 (clang-1001.0.46.4)]
executable: /usr/local/var/pyenv/versions/3.7.3/bin/python3
   machine: Darwin-19.3.0-x86_64-i386-64bit

Python dependencies:
       pip: 20.0.2
setuptools: 40.8.0
   sklearn: 0.22
     numpy: 1.18.0
     scipy: 1.4.1
    Cython: None
    pandas: 0.25.3
matplotlib: 3.1.2
    joblib: 0.14.1

Built with OpenMP: True

System:
    python: 3.7.6 (default, Feb 14 2020, 16:41:52)  [GCC 8.3.1 20190507 (Red Hat 8.3.1-4)]
executable: /home/***/ws/siml/.venv/bin/python3
   machine: Linux-4.18.0-147.5.1.el8_1.x86_64-x86_64-with-centos-8.1.1911-Core

Python dependencies:
       pip: 19.2.3
setuptools: 41.2.0
   sklearn: 0.22.1
     numpy: 1.18.1
     scipy: 1.4.1
    Cython: None
    pandas: 0.25.3
matplotlib: 3.1.3
    joblib: 0.14.1

Built with OpenMP: True

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:9 (7 by maintainers)

Top GitHub Comments

2reactions

glemaitrecommented, Feb 17, 2020

OK this is solved in https://github.com/scikit-learn/scikit-learn/pull/16466 The bug was due to some badly thought implicit casting. It was tricky because the bug would only be triggered when the number of samples during a partial_fit call were greater than the total number of samples seen in the past.

0reactions

yellowshippocommented, Feb 18, 2020

Thank you all very much!!!

Top Results From Across the Web

sklearn.preprocessing.StandardScaler

This scaler can also be applied to sparse CSR or CSC matrices by passing ... If True, scale the data to unit variance...

How do we standardize arrays with NaN?

This is where I get the error :Input contains NaN, infinity or a value too large for dtype('float64'). from sklearn.preprocessing import StandardScaler # ......

Preprocessing with sklearn: a complete and comprehensive ...

The data contains obvious missing values expressed as not-a-number or 999. import numpy as np import pandas as pdX = pd.DataFrame( np.array([5,7 ...

Assigning NaN to -1 after performing StandardScaler

It's not exactly that simple to deal with NaN values. It requires analyses of the data before taking any further step to deal...

cuML API Reference — cuml 22.10.00 documentation

cuML follows a verbosity model similar to Scikit-learn's: The verbose ... from cuml.preprocessing import StandardScaler >>> import cupy as cp >>> data =...