Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Programmatically pass categorical_features to HGBT

See original GitHub issue

Describe the workflow you want to enable

#18394 added native support for categorical features to HGBT. Therefore, you have to ordinal encode your categoricals, e.g. in a ColumnTransformer (potentially part of a pipeline), and indicate the column positions of the passed X via the parameter categorical_features.

How can we then programmatically, i.e. without manually filling in categorical_features, specify the positions of categorical (ordinal encoded) columns in the feature matrix X that is finally passed to HGBT?

X, y = ...

ct = make_column_transformer(
    (OrdinalEncoder(),
     make_column_selector(dtype_include='category')),
    remainder='passthrough')

hist_native = make_pipeline(
    ct,
    HistGradientBoostingRegressor(categorical_features=???)
)

How to fill ????

Possible solutions

Set it manually, e.g. use OrdinalEncoder as first or last part of a ColumnTransformer. This is currently used in this example but it’s not ideal
Passing a callable/function, e.g HistGradientBoostingRegressor(categorical_features=my_function), see https://github.com/scikit-learn/scikit-learn/pull/18394#issuecomment-731568451 for details.

Sadly, this doesn’t work. It breaks when the pipeline is used in e.g. cross_val_score because the estimators will be cloned there, and thus the callable refers to an unfitted CT:
Pass feature names once they are available. Even then, you have to know the exact feature names that are created by OrdinalEncoder.
Pass feature-aligned meta data “this is a categorical feature” similar to SLEP006 and proposed in #4196.
Internally use an OE within the GBDT estimator so that users don’t need to create a pipeline

Further context

One day, this might become relevant for more estimators, for linear models see #18893.

Issue Analytics

State:
Created 3 years ago
Reactions:3
Comments:14 (14 by maintainers)

Top GitHub Comments

2reactions

NicolasHugcommented, Nov 22, 2020

(I edited the post to add option 5)

Ideally we would have 4 eventually, although I don’t see it happening too soon.

Option 5 arguably makes life easiest for users as they don’t even need to use a pipeline anymore. It also adds complexity into the estimator because we basically need to mimic the pipeline logic (e.g. call self._encoder.transform() in predict(), etc). I feel like despite the potential complexity, it might be worth giving it a try at least as a prototype, as it’s probably the most contained solution, so a PR is more likely to be acceptable in a near future than the other ones.

1reaction

thomasjpfancommented, Nov 26, 2020

Throwing another idea out there:

def set_hist_categories(pipe):
	categorical_indices = pipe.output_indices_['ordinalencoder']  # see PR #18393
	pipe.set_params(hist__categorical_features=categorical_indices)

pipe = Pipeline([
	('preprocess', ColumnTransformer(...)),
	('hist', HistGradientBoostingRegressor(...))
], set_params=set_hist_categories)

Top Results From Across the Web

Categorical Data — xgboost 1.7.2 documentation

The easiest way to pass categorical data into XGBoost is using dataframe and the scikit-learn interface like XGBClassifier . For preparing the data, ......

3 Ways to Encode Categorical Variables for Deep Learning

This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model....

Dealing with Categorical Variables in Machine Learning

Here, we shall compare 3 classification algorithms of which LightGBM and CatBoost can handle categorical variables and LogisticRegression using ...

Division : Worldwide Development Information Type : Reporting and ...

randomized number as soon as passed screening. •. Refer to Appendix 10: List of Data Displays which details the population used for each...

Congenital Heart Surery Database Training Manual

Data Source: User. Format: Text (categorical values specified by STS). Harvest Codes: ... This should be a programmatic decision with a consistent data....