Programmatically pass categorical_features to HGBT
See original GitHub issueDescribe the workflow you want to enable
#18394 added native support for categorical features to HGBT. Therefore, you have to ordinal encode your categoricals, e.g. in a ColumnTransformer
(potentially part of a pipeline), and indicate the column positions of the passed X
via the parameter categorical_features
.
How can we then programmatically, i.e. without manually filling in categorical_features
, specify the positions of categorical (ordinal encoded) columns in the feature matrix X
that is finally passed to HGBT?
X, y = ...
ct = make_column_transformer(
(OrdinalEncoder(),
make_column_selector(dtype_include='category')),
remainder='passthrough')
hist_native = make_pipeline(
ct,
HistGradientBoostingRegressor(categorical_features=???)
)
How to fill ???
?
Possible solutions
- Set it manually, e.g. use
OrdinalEncoder
as first or last part of aColumnTransformer
. This is currently used in this example but it’s not ideal - Passing a callable/function, e.g
HistGradientBoostingRegressor(categorical_features=my_function)
, see https://github.com/scikit-learn/scikit-learn/pull/18394#issuecomment-731568451 for details.Sadly, this doesn’t work. It breaks when the pipeline is used in e.g. cross_val_score because the estimators will be cloned there, and thus the callable refers to an unfitted CT:
- Pass feature names once they are available. Even then, you have to know the exact feature names that are created by
OrdinalEncoder
. - Pass feature-aligned meta data “this is a categorical feature” similar to SLEP006 and proposed in #4196.
- Internally use an OE within the GBDT estimator so that users don’t need to create a pipeline
Further context
One day, this might become relevant for more estimators, for linear models see #18893.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:14 (14 by maintainers)
(I edited the post to add option 5)
Ideally we would have 4 eventually, although I don’t see it happening too soon.
Option 5 arguably makes life easiest for users as they don’t even need to use a pipeline anymore. It also adds complexity into the estimator because we basically need to mimic the pipeline logic (e.g. call
self._encoder.transform()
inpredict()
, etc). I feel like despite the potential complexity, it might be worth giving it a try at least as a prototype, as it’s probably the most contained solution, so a PR is more likely to be acceptable in a near future than the other ones.Throwing another idea out there: