Mixed-type imputation for IterativeImputer
See original GitHub issueI am opening this issue to more clearly document scattered comments I have made about this idea in several issues.
Workflow
For even the simplest dataset (say the Titanic dataset), dealing with missing values and encoding data are essential pre-processing steps. Scikit-Learn has a lot of great tools to deal with this, but I am going to focus on IterativeImputer
for missing data and ColumnTransfromer
(and by extension, any other transformers that can be fit inside of it) for encoding.
Please correct me if I am mistaken, but I believe that there is currently no easy way to use these two tools together. Transformers do not work with missing values, and IterativeImputer
only works with numerical continuous data. Of course one could use SimpleImputer -> ColumnTransformer
but then you cannot take advantage of more advanced imputation strategies.
I would like to have the ability to feed IterativeImputer
a ColumnTransformer
object (that may already even be fit) along with a list of estimators (so basically, I am telling what estimators to use for each column and how to transform the data for those estimators) and have it give me back my data, in it’s original format, with imputed values. Then of course you can manually pass it through that same ColumnTransformer
again and move it down the pipeline.
Proposed solution
Changes to IterativeImputer
:
- A new parameter called
transformer
that defaults toNone
. - Making the
estimator
parameter accept an iterable in addition to the single estimator it currently supports. - Introduce a new step where
ColumnTransformer
gets applied. This would be between the initial imputation step (usingSimpleImputer
) and the estimator steps. - Some internal changing to avoid errors from trying to do numerical operations on
object
dtype
data. I tested implementing these fixes, this part is trivial.
I think these changes should be backwards compatible.
In terms of ColumnTransformer
, the two needed features for this to work are:
inverse_transform
: work on this seems to be underway in #11639, but it’s a bit stalled it seems.- Ability to select which columns are present when using
transform
: a similar issue was brought up in #15781. I think this could even be achieved with the private method_calculate_inverse_indices
proposed in #11639 if it is made public.
Problems with this approach I can think of two main problems:
- Convergence: there is already concern regarding convergence of
IterativeImputer
, I am fairly certain that introducing classifiers would make it worse. A simple fix would be to not support tolerance/convergence based early termination if one or more classifiers are used as estimators. It is easy to check if any of the estimators are classifiers. Several parameters related to convergence (ex:tol
) would need to raise errors if they are not set toNone
(and maybe the default should be changed?). - Initial imputation: we would have to restrict the
initial_strategy
parameter toconstant
andmost_frequent
when classification tasks are present.
Existing work
I tried implementing this proposal. In some sense, I got pretty far, I resolved the errors from trying to apply numerical numpy operations to object dtype data as well as getting a list of estimators to work. Where I ran into issues was the ColumnTransformer
problems mentioned above.
Example of desired usage
Given a dataset that looks something like this (this is X
only):
Age | Sex | Cabin |
---|---|---|
16 | M | NaN |
56 | NaN | C19 |
NaN | F | XYZ |
We have missing continuous numerical data (Age), two-level categorical data (Sex) and multi-level categorical data (Cabin). We want to impute these missing values. The idea would be something as follows:
X = .... # the data described above
estimators = [
BayesianRidge(), # for Age
LogisticRegression(), # for Sex
DecisionTreeClassifier(), # for Cabin
]
transformer = ColumnTransformer(transformers= [
('num', StandardScaler(), [0]),
('cat', OneHotEncoder(), [1, 2]),
])
imputer = IterativeImputer(
estimator=estimators,
transformer=transformer,
)
imputer.fit_transform(X)
And we’d get back something like:
Age | Sex | Cabin |
---|---|---|
16 | M | F32 |
56 | F | C19 |
55 | F | XYZ |
Issue Analytics
- State:
- Created 3 years ago
- Reactions:7
- Comments:25 (22 by maintainers)
Top GitHub Comments
OK, so in that case it likely makes sense (in my use case) to impute missing data (regression) and then separately use any supervised prediction approach for the categorical data.
I’m working with geological data - geochemical analyses (numerical) and rock type (categorical). The mix of missing data is extremely common in my discipline, but I’m not aware of anyone working to use round-robin methods to address imputing mixed data
Okay so I have an initial working prototype! It passes all of the existing tests, as well as this super simple test for the new functionality:
The branch is here if anyone wants to take a look, but it is very hacky for now. A couple of notes on what I found:
transformers
parameter because as @jnothman mentioned above it is hard to split up the pipeline. This way we don’t have to deal with that, and it is easier for users I think. I chose to drop thename
part of the column specification (forColumnTransformer
it is('name', transf_obj, columns)
but here just(transf_obj, columns)
) because I don’t think we have a use for thename
parameter, but I guess we could keep it for symmetry._impute_one_feature
is called. This minimizes the changes to the indexing, but it would be more efficient to apply the transformation once to the entire input, iterate with adjusted indexing (i.e. need to adjustfeat_idx
, etc.) and then reverse the transformation once iteration is done. That however would require extensive refactoring.ColumnTransformer
because what we end up doing with the transformers here (make a copy for each column and fit independently) is very different.if
statements to skip convergence and type checks when using non-numeric values. Ideally this would be refactored into 1-2 checks max.BayesianRidge
is used.