question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Programmatically pass categorical_features to HGBT

See original GitHub issue

Describe the workflow you want to enable

#18394 added native support for categorical features to HGBT. Therefore, you have to ordinal encode your categoricals, e.g. in a ColumnTransformer (potentially part of a pipeline), and indicate the column positions of the passed X via the parameter categorical_features.

How can we then programmatically, i.e. without manually filling in categorical_features, specify the positions of categorical (ordinal encoded) columns in the feature matrix X that is finally passed to HGBT?

X, y = ...

ct = make_column_transformer(
    (OrdinalEncoder(),
     make_column_selector(dtype_include='category')),
    remainder='passthrough')

hist_native = make_pipeline(
    ct,
    HistGradientBoostingRegressor(categorical_features=???)
)

How to fill ????

Possible solutions

  1. Set it manually, e.g. use OrdinalEncoder as first or last part of a ColumnTransformer. This is currently used in this example but it’s not ideal
  2. Passing a callable/function, e.g HistGradientBoostingRegressor(categorical_features=my_function), see https://github.com/scikit-learn/scikit-learn/pull/18394#issuecomment-731568451 for details.

    Sadly, this doesn’t work. It breaks when the pipeline is used in e.g. cross_val_score because the estimators will be cloned there, and thus the callable refers to an unfitted CT:

  3. Pass feature names once they are available. Even then, you have to know the exact feature names that are created by OrdinalEncoder.
  4. Pass feature-aligned meta data “this is a categorical feature” similar to SLEP006 and proposed in #4196.
  5. Internally use an OE within the GBDT estimator so that users don’t need to create a pipeline

Further context

One day, this might become relevant for more estimators, for linear models see #18893.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:3
  • Comments:14 (14 by maintainers)

github_iconTop GitHub Comments

2reactions
NicolasHugcommented, Nov 22, 2020

(I edited the post to add option 5)

Ideally we would have 4 eventually, although I don’t see it happening too soon.

Option 5 arguably makes life easiest for users as they don’t even need to use a pipeline anymore. It also adds complexity into the estimator because we basically need to mimic the pipeline logic (e.g. call self._encoder.transform() in predict(), etc). I feel like despite the potential complexity, it might be worth giving it a try at least as a prototype, as it’s probably the most contained solution, so a PR is more likely to be acceptable in a near future than the other ones.

1reaction
thomasjpfancommented, Nov 26, 2020

Throwing another idea out there:

def set_hist_categories(pipe):
	categorical_indices = pipe.output_indices_['ordinalencoder']  # see PR #18393
	pipe.set_params(hist__categorical_features=categorical_indices)

pipe = Pipeline([
	('preprocess', ColumnTransformer(...)),
	('hist', HistGradientBoostingRegressor(...))
], set_params=set_hist_categories)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Categorical Data — xgboost 1.7.2 documentation
The easiest way to pass categorical data into XGBoost is using dataframe and the scikit-learn interface like XGBClassifier . For preparing the data, ......
Read more >
3 Ways to Encode Categorical Variables for Deep Learning
This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model....
Read more >
Dealing with Categorical Variables in Machine Learning
Here, we shall compare 3 classification algorithms of which LightGBM and CatBoost can handle categorical variables and LogisticRegression using ...
Read more >
Division : Worldwide Development Information Type : Reporting and ...
randomized number as soon as passed screening. •. Refer to Appendix 10: List of Data Displays which details the population used for each...
Read more >
Congenital Heart Surery Database Training Manual
Data Source: User. Format: Text (categorical values specified by STS). Harvest Codes: ... This should be a programmatic decision with a consistent data....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found