question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SimpleImputer performance on strings

See original GitHub issue

Describe the bug

SimpleImputer is extremely slow on string/object categories using option “most_frequent”.

Steps/Code to Reproduce

import sys; print('python', sys.version)
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

si = SimpleImputer(strategy='most_frequent')

n = 1_000_000
df = pd.DataFrame(
    dict(a_string_column = np.random.choice(['a', 'b'], n),
         a_numeric_column = np.random.choice([1, 2], n))
)

print('np.unique:')
%time np.unique(df.a_string_column, return_counts=True)

print('\nsimple imputer numeric:')
%time si.fit(df[['a_numeric_column']])

print('\nsimple imputer string/object:')
%time si.fit(df[['a_string_column']])

Output

np.unique:
Wall time: 766 ms

simple imputer numeric:
Wall time: 43 ms

simple imputer string/object:
Wall time: 29.7 s

Versions

System:
    python: 3.7.9
   machine: Windows-10

Python dependencies:
          pip: 20.2.4
   setuptools: 50.3.0.post20201006
      sklearn: 0.23.2
        numpy: 1.19.4
        scipy: 1.4.1
       Cython: None
       pandas: 1.1.3
   matplotlib: 3.3.1
       joblib: 0.17.0
threadpoolctl: 2.1.0

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:12 (11 by maintainers)

github_iconTop GitHub Comments

2reactions
NicolasHugcommented, Dec 9, 2020

@DavidKatz-il feel free to comment below with take to be assigned to this PR

1reaction
DavidKatz-ilcommented, Dec 9, 2020

take

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.impute.SimpleImputer
Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned. If “constant”,...
Read more >
How to use the SimpleImputer Class in Machine Learning with ...
To use SimpleImputer, first import the class, and then instantiate the class with a string argument passed to the strategy parameter.
Read more >
Scikit-Learn's SimpleImputer - Fill Missing Values
Performing imputation using the 'mean' strategy in SimpleImputer ... most frequent value along each column and it can be used with strings or...
Read more >
Handling Missing Data with SimpleImputer - Analytics Vidhya
Performing “Median” Imputation. Using the strategy “Median” in the SimpleImputer allows us to impute the missing value by the median value of ...
Read more >
sklearn SimpleImputer too slow for categorical data ...
Things get even worse when string values are longer (e.g. 'abc' instead of one letter 'a'). For numerical data pandas still outperform sklearn, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found