question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

inconsistent treatment of None and np.NaN in SimpleImputer

See original GitHub issue

Doing constant imputation treats only the “missing_value” as missing, so a None by default stays there:

from sklearn.impute import SimpleImputer
import numpy as np
X = np.array([1, 2, np.NaN, None]).reshape(-1, 1)
SimpleImputer(strategy='constant', fill_value="asdf").fit_transform()
array([[1],
       [2],
       ['asdf'],
       [None]], dtype=object)

However, using strategy=‘mean’ coerces the None to NaN and so both are replaced:

SimpleImputer(strategy='mean').fit_transform(X)
array([[1. ],
       [2. ],
       [1.5],
       [1.5]])

I don’t think the definition of what’s missing should depend on the strategy. @thomasjpfan argues that the current constant behavior is inconvenient because it means you have to impute both values separately if you want to one-hot-encode.

It seems more safe to treat them differently but I’m not sure there’s a use-case for that. This came up in #17317. I think this only matters in these two, as other imputers don’t allow dtype object arrays.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
jeremiedbbcommented, Jun 18, 2020

mean and median only support numeric arrays, so it’s converted in data validation and None becomes np.nan. In this case passing a list of missing values won’t work. Maybe we could issue a warning when the array has to be converted ?

0reactions
thomasjpfancommented, Dec 26, 2020

On a similar note, setting missing_values=None would raise:

from sklearn.impute import SimpleImputer
import numpy as np

X = np.array([1, 2, np.NaN, None]).reshape(-1, 1)
imputer = SimpleImputer(strategy='constant', fill_value="asdf", 
					    missing_values=None)

# raises
imputer.fit_transform(X)

In the pandas case, since we coerce pd.NA to np.nan the following works:

from sklearn.utils.validation import check_array
import pandas as pd

X_pd =  pd.DataFrame({'f1': pd.Series(['dog', 'cat', pd.NA, None], dtype='category')})

imputer = SimpleImputer(strategy='constant',
					    fill_value='adsf')
imputer.fit_transform(X_pd)

# array([['dog'],
#       ['cat'],
#       ['adsf'],
#       ['adsf']], dtype=object)

This behavior is currently documented in SimpleImputer.

Given we already do coercing to this extend, I would be okay with coercing None to np.nan for object dtypes.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How do I combine different kind of missing_values in sklearn's ...
I have data with two different kind of missing values (np.nan and None) and I am trying to impute them using SimpleImputer.
Read more >
How to Handle Missing Data with Python
In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN. Values with a NaN value are ignored from operations like ......
Read more >
Simple techniques for missing data imputation | Kaggle
Nothing in the data will indicate which of these models is correct. And, unfortunately, results could be highly sensitive to the choice of...
Read more >
Missing Data Conundrum: Exploration and Imputation ...
Case Deletion · import pandas as pd · import numpy as np · import fancyimpute · from sklearn.impute import SimpleImputer · data =...
Read more >
Dealing with Missing Values NaN and None in Python - Medium
Despite the data type difference of NaN and None , Pandas treat numpy.nan and None similarly. For an example, we create a pandas.DataFrame...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found