Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

inconsistent treatment of None and np.NaN in SimpleImputer

See original GitHub issue

Doing constant imputation treats only the “missing_value” as missing, so a None by default stays there:

from sklearn.impute import SimpleImputer
import numpy as np
X = np.array([1, 2, np.NaN, None]).reshape(-1, 1)
SimpleImputer(strategy='constant', fill_value="asdf").fit_transform()

array([[1],
       [2],
       ['asdf'],
       [None]], dtype=object)

However, using strategy=‘mean’ coerces the None to NaN and so both are replaced:

SimpleImputer(strategy='mean').fit_transform(X)

array([[1. ],
       [2. ],
       [1.5],
       [1.5]])

I don’t think the definition of what’s missing should depend on the strategy. @thomasjpfan argues that the current constant behavior is inconvenient because it means you have to impute both values separately if you want to one-hot-encode.

It seems more safe to treat them differently but I’m not sure there’s a use-case for that. This came up in #17317. I think this only matters in these two, as other imputers don’t allow dtype object arrays.

Issue Analytics

State:
Created 3 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

jeremiedbbcommented, Jun 18, 2020

mean and median only support numeric arrays, so it’s converted in data validation and None becomes np.nan. In this case passing a list of missing values won’t work. Maybe we could issue a warning when the array has to be converted ?

0reactions

thomasjpfancommented, Dec 26, 2020

On a similar note, setting missing_values=None would raise:

from sklearn.impute import SimpleImputer
import numpy as np

X = np.array([1, 2, np.NaN, None]).reshape(-1, 1)
imputer = SimpleImputer(strategy='constant', fill_value="asdf", 
					    missing_values=None)

# raises
imputer.fit_transform(X)

In the pandas case, since we coerce pd.NA to np.nan the following works:

from sklearn.utils.validation import check_array
import pandas as pd

X_pd =  pd.DataFrame({'f1': pd.Series(['dog', 'cat', pd.NA, None], dtype='category')})

imputer = SimpleImputer(strategy='constant',
					    fill_value='adsf')
imputer.fit_transform(X_pd)

# array([['dog'],
#       ['cat'],
#       ['adsf'],
#       ['adsf']], dtype=object)

This behavior is currently documented in SimpleImputer.

Given we already do coercing to this extend, I would be okay with coercing None to np.nan for object dtypes.

Top Results From Across the Web

How do I combine different kind of missing_values in sklearn's ...

I have data with two different kind of missing values (np.nan and None) and I am trying to impute them using SimpleImputer.

How to Handle Missing Data with Python

In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN. Values with a NaN value are ignored from operations like ......

Simple techniques for missing data imputation | Kaggle

Nothing in the data will indicate which of these models is correct. And, unfortunately, results could be highly sensitive to the choice of...

Missing Data Conundrum: Exploration and Imputation ...

Case Deletion · import pandas as pd · import numpy as np · import fancyimpute · from sklearn.impute import SimpleImputer · data =...

Dealing with Missing Values NaN and None in Python - Medium

Despite the data type difference of NaN and None , Pandas treat numpy.nan and None similarly. For an example, we create a pandas.DataFrame...