sklearn.preprocessing.StandardScaler gets NaN variance when partial_fit with sparse data
See original GitHub issueDescribe the bug
When I feed a specific dataset (which is sparse) to sklearn.preprocessing.StandardScaler.partial_fit in a specific order, I get variance which is NaN although data does NOT contains any NaNs and is very small. When I convert the sparse arrays to dense, it works. When I change the order to feed the data, it works too.
Steps/Code to Reproduce
Please work with the data I attached. sparse_data.tar.gz
import scipy.sparse as sp
from sklearn import preprocessing
s0 = sp.load_npz('0.npz')
s1 = sp.load_npz('1.npz')
# Buggy behavior
ss0 = preprocessing.StandardScaler(with_mean=False)
ss0.partial_fit(s0)
print(ss0.var_)
ss0.partial_fit(s1)
print(ss0.var_) # => gets NaN
# When use dence array, it works
ss1 = preprocessing.StandardScaler(with_mean=False)
ss1.partial_fit(s0.toarray())
print(ss1.var_)
ss1.partial_fit(s1.toarray())
print(ss1.var_)
# When change the order of data, it works
ss2 = preprocessing.StandardScaler(with_mean=False)
ss2.partial_fit(s1)
print(ss2.var_)
ss2.partial_fit(s0)
print(ss2.var_)
EDIT: Fix sample code around ss2
Expected Results
ss0.var_ # => [0.15896542]
ss1.var_ # => [0.15896542]
ss2.var_ # => [0.15896542]
Actual Results
ss0.var_ # => [nan]
ss1.var_ # => [0.15896542]
ss2.var_ # => [0.15896542]
Versions
I confirmed this issue in two different environments.
System:
python: 3.7.3 (default, Apr 22 2019, 02:40:09) [Clang 10.0.1 (clang-1001.0.46.4)]
executable: /usr/local/var/pyenv/versions/3.7.3/bin/python3
machine: Darwin-19.3.0-x86_64-i386-64bit
Python dependencies:
pip: 20.0.2
setuptools: 40.8.0
sklearn: 0.22
numpy: 1.18.0
scipy: 1.4.1
Cython: None
pandas: 0.25.3
matplotlib: 3.1.2
joblib: 0.14.1
Built with OpenMP: True
System:
python: 3.7.6 (default, Feb 14 2020, 16:41:52) [GCC 8.3.1 20190507 (Red Hat 8.3.1-4)]
executable: /home/***/ws/siml/.venv/bin/python3
machine: Linux-4.18.0-147.5.1.el8_1.x86_64-x86_64-with-centos-8.1.1911-Core
Python dependencies:
pip: 19.2.3
setuptools: 41.2.0
sklearn: 0.22.1
numpy: 1.18.1
scipy: 1.4.1
Cython: None
pandas: 0.25.3
matplotlib: 3.1.3
joblib: 0.14.1
Built with OpenMP: True
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:9 (7 by maintainers)
Top Results From Across the Web
sklearn.preprocessing.StandardScaler
This scaler can also be applied to sparse CSR or CSC matrices by passing ... If True, scale the data to unit variance...
Read more >How do we standardize arrays with NaN?
This is where I get the error :Input contains NaN, infinity or a value too large for dtype('float64'). from sklearn.preprocessing import StandardScaler # ......
Read more >Preprocessing with sklearn: a complete and comprehensive ...
The data contains obvious missing values expressed as not-a-number or 999. import numpy as np import pandas as pdX = pd.DataFrame( np.array([5,7 ...
Read more >Assigning NaN to -1 after performing StandardScaler
It's not exactly that simple to deal with NaN values. It requires analyses of the data before taking any further step to deal...
Read more >cuML API Reference — cuml 22.10.00 documentation
cuML follows a verbosity model similar to Scikit-learn's: The verbose ... from cuml.preprocessing import StandardScaler >>> import cupy as cp >>> data =...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
OK this is solved in https://github.com/scikit-learn/scikit-learn/pull/16466 The bug was due to some badly thought implicit casting. It was tricky because the bug would only be triggered when the number of samples during a
partial_fit
call were greater than the total number of samples seen in the past.Thank you all very much!!!