question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Mixed-type imputation for IterativeImputer

See original GitHub issue

I am opening this issue to more clearly document scattered comments I have made about this idea in several issues.

Workflow For even the simplest dataset (say the Titanic dataset), dealing with missing values and encoding data are essential pre-processing steps. Scikit-Learn has a lot of great tools to deal with this, but I am going to focus on IterativeImputer for missing data and ColumnTransfromer (and by extension, any other transformers that can be fit inside of it) for encoding.

Please correct me if I am mistaken, but I believe that there is currently no easy way to use these two tools together. Transformers do not work with missing values, and IterativeImputer only works with numerical continuous data. Of course one could use SimpleImputer -> ColumnTransformer but then you cannot take advantage of more advanced imputation strategies.

I would like to have the ability to feed IterativeImputer a ColumnTransformer object (that may already even be fit) along with a list of estimators (so basically, I am telling what estimators to use for each column and how to transform the data for those estimators) and have it give me back my data, in it’s original format, with imputed values. Then of course you can manually pass it through that same ColumnTransformer again and move it down the pipeline.

Proposed solution Changes to IterativeImputer:

  1. A new parameter called transformer that defaults to None.
  2. Making the estimator parameter accept an iterable in addition to the single estimator it currently supports.
  3. Introduce a new step where ColumnTransformer gets applied. This would be between the initial imputation step (using SimpleImputer) and the estimator steps.
  4. Some internal changing to avoid errors from trying to do numerical operations on object dtype data. I tested implementing these fixes, this part is trivial.

I think these changes should be backwards compatible.

In terms of ColumnTransformer, the two needed features for this to work are:

  1. inverse_transform: work on this seems to be underway in #11639, but it’s a bit stalled it seems.
  2. Ability to select which columns are present when using transform: a similar issue was brought up in #15781. I think this could even be achieved with the private method _calculate_inverse_indices proposed in #11639 if it is made public.

Problems with this approach I can think of two main problems:

  1. Convergence: there is already concern regarding convergence of IterativeImputer, I am fairly certain that introducing classifiers would make it worse. A simple fix would be to not support tolerance/convergence based early termination if one or more classifiers are used as estimators. It is easy to check if any of the estimators are classifiers. Several parameters related to convergence (ex: tol) would need to raise errors if they are not set to None (and maybe the default should be changed?).
  2. Initial imputation: we would have to restrict the initial_strategy parameter to constant and most_frequent when classification tasks are present.

Existing work I tried implementing this proposal. In some sense, I got pretty far, I resolved the errors from trying to apply numerical numpy operations to object dtype data as well as getting a list of estimators to work. Where I ran into issues was the ColumnTransformer problems mentioned above.

Example of desired usage Given a dataset that looks something like this (this is X only):

Age Sex Cabin
16 M NaN
56 NaN C19
NaN F XYZ

We have missing continuous numerical data (Age), two-level categorical data (Sex) and multi-level categorical data (Cabin). We want to impute these missing values. The idea would be something as follows:

X = .... # the data described above

estimators = [
    BayesianRidge(),  # for Age
    LogisticRegression(),  # for Sex
    DecisionTreeClassifier(),  # for Cabin
]

transformer = ColumnTransformer(transformers= [
    ('num', StandardScaler(), [0]),
    ('cat', OneHotEncoder(), [1, 2]),
])

imputer = IterativeImputer(
    estimator=estimators,
    transformer=transformer,
)

imputer.fit_transform(X)

And we’d get back something like:

Age Sex Cabin
16 M F32
56 F C19
55 F XYZ

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:7
  • Comments:25 (22 by maintainers)

github_iconTop GitHub Comments

2reactions
AuSpottercommented, Feb 13, 2021

OK, so in that case it likely makes sense (in my use case) to impute missing data (regression) and then separately use any supervised prediction approach for the categorical data.

I’m working with geological data - geochemical analyses (numerical) and rock type (categorical). The mix of missing data is extremely common in my discipline, but I’m not aware of anyone working to use round-robin methods to address imputing mixed data

1reaction
adriangbcommented, May 24, 2020

Okay so I have an initial working prototype! It passes all of the existing tests, as well as this super simple test for the new functionality:

import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer


age = [np.nan, 82.0, 28.0]
sex = ["male", "female", np.nan]
cabin = ["c1", np.nan, "e8"]

X = pd.DataFrame({"age": age, "sex": sex, "cabin": cabin})


imp = IterativeImputer(
    estimator=[(RandomForestClassifier(), slice(1, 3))],
    transformers=[(OneHotEncoder(sparse=False), slice(1, 3))],
    initial_strategy="most_frequent"
)

X_filled = imp.fit_transform(X)

print(X_filled)
[[55.0 'male' 'c1']
[82.0 'female' 'c1']
[28.0 'male' 'e8']]

The branch is here if anyone wants to take a look, but it is very hacky for now. A couple of notes on what I found:

  1. I decided to go with a separate transformers parameter because as @jnothman mentioned above it is hard to split up the pipeline. This way we don’t have to deal with that, and it is easier for users I think. I chose to drop the name part of the column specification (for ColumnTransformer it is ('name', transf_obj, columns) but here just (transf_obj, columns)) because I don’t think we have a use for the name parameter, but I guess we could keep it for symmetry.
  2. I am currently applying and reversing the transformation each time _impute_one_feature is called. This minimizes the changes to the indexing, but it would be more efficient to apply the transformation once to the entire input, iterate with adjusted indexing (i.e. need to adjust feat_idx, etc.) and then reverse the transformation once iteration is done. That however would require extensive refactoring.
  3. I actually copied almost nothing from ColumnTransformer because what we end up doing with the transformers here (make a copy for each column and fit independently) is very different.
  4. I had to add ~4 if statements to skip convergence and type checks when using non-numeric values. Ideally this would be refactored into 1-2 checks max.
  5. If no estimator is specified for a given column, the default BayesianRidge is used.
Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.impute.IterativeImputer
A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion....
Read more >
A Better Way to Handle Missing Values in your Dataset
Iterative Imputer is a multivariate imputing strategy that models a column with the missing values (target variable) as a function of other features...
Read more >
Implementing KNN imputation on categorical variables in an ...
The MissForest approach is not only able to deal with mixed type variables, it is also more reliable in imputation, both in the...
Read more >
Multiple Imputation via Generative Adversarial Network for ...
As the gold standard of handling missing data, multiple imputation (MI) methods ... For MICE, we use the iterativeImputer method in the scikit-learn...
Read more >
Can a Python package do what mice can?
such as using KNNImputer from sklearn.impute, IterativeImputer from ... missing value imputation for mixed-type data. Bioinformatics, 28(1):. 112–118, 2012.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found