SimpleImputer performance on strings
See original GitHub issueDescribe the bug
SimpleImputer is extremely slow on string/object categories using option “most_frequent”.
Steps/Code to Reproduce
import sys; print('python', sys.version)
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='most_frequent')
n = 1_000_000
df = pd.DataFrame(
dict(a_string_column = np.random.choice(['a', 'b'], n),
a_numeric_column = np.random.choice([1, 2], n))
)
print('np.unique:')
%time np.unique(df.a_string_column, return_counts=True)
print('\nsimple imputer numeric:')
%time si.fit(df[['a_numeric_column']])
print('\nsimple imputer string/object:')
%time si.fit(df[['a_string_column']])
Output
np.unique:
Wall time: 766 ms
simple imputer numeric:
Wall time: 43 ms
simple imputer string/object:
Wall time: 29.7 s
Versions
System:
python: 3.7.9
machine: Windows-10
Python dependencies:
pip: 20.2.4
setuptools: 50.3.0.post20201006
sklearn: 0.23.2
numpy: 1.19.4
scipy: 1.4.1
Cython: None
pandas: 1.1.3
matplotlib: 3.3.1
joblib: 0.17.0
threadpoolctl: 2.1.0
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (11 by maintainers)
Top Results From Across the Web
sklearn.impute.SimpleImputer
Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned. If “constant”,...
Read more >How to use the SimpleImputer Class in Machine Learning with ...
To use SimpleImputer, first import the class, and then instantiate the class with a string argument passed to the strategy parameter.
Read more >Scikit-Learn's SimpleImputer - Fill Missing Values
Performing imputation using the 'mean' strategy in SimpleImputer ... most frequent value along each column and it can be used with strings or...
Read more >Handling Missing Data with SimpleImputer - Analytics Vidhya
Performing “Median” Imputation. Using the strategy “Median” in the SimpleImputer allows us to impute the missing value by the median value of ...
Read more >sklearn SimpleImputer too slow for categorical data ...
Things get even worse when string values are longer (e.g. 'abc' instead of one letter 'a'). For numerical data pandas still outperform sklearn, ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@DavidKatz-il feel free to comment below with
take
to be assigned to this PRtake