inconsistent treatment of None and np.NaN in SimpleImputer
See original GitHub issueDoing constant imputation treats only the “missing_value” as missing, so a None
by default stays there:
from sklearn.impute import SimpleImputer
import numpy as np
X = np.array([1, 2, np.NaN, None]).reshape(-1, 1)
SimpleImputer(strategy='constant', fill_value="asdf").fit_transform()
array([[1],
[2],
['asdf'],
[None]], dtype=object)
However, using strategy=‘mean’ coerces the None to NaN and so both are replaced:
SimpleImputer(strategy='mean').fit_transform(X)
array([[1. ],
[2. ],
[1.5],
[1.5]])
I don’t think the definition of what’s missing should depend on the strategy. @thomasjpfan argues that the current constant behavior is inconvenient because it means you have to impute both values separately if you want to one-hot-encode.
It seems more safe to treat them differently but I’m not sure there’s a use-case for that. This came up in #17317. I think this only matters in these two, as other imputers don’t allow dtype object arrays.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
How do I combine different kind of missing_values in sklearn's ...
I have data with two different kind of missing values (np.nan and None) and I am trying to impute them using SimpleImputer.
Read more >How to Handle Missing Data with Python
In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN. Values with a NaN value are ignored from operations like ......
Read more >Simple techniques for missing data imputation | Kaggle
Nothing in the data will indicate which of these models is correct. And, unfortunately, results could be highly sensitive to the choice of...
Read more >Missing Data Conundrum: Exploration and Imputation ...
Case Deletion · import pandas as pd · import numpy as np · import fancyimpute · from sklearn.impute import SimpleImputer · data =...
Read more >Dealing with Missing Values NaN and None in Python - Medium
Despite the data type difference of NaN and None , Pandas treat numpy.nan and None similarly. For an example, we create a pandas.DataFrame...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
mean
andmedian
only support numeric arrays, so it’s converted in data validation and None becomes np.nan. In this case passing a list of missing values won’t work. Maybe we could issue a warning when the array has to be converted ?On a similar note, setting
missing_values=None
would raise:In the pandas case, since we coerce
pd.NA
tonp.nan
the following works:This behavior is currently documented in
SimpleImputer
.Given we already do coercing to this extend, I would be okay with coercing
None
tonp.nan
for object dtypes.