question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

New strategies for KBinsDiscretizer

See original GitHub issue

New non-parametric strategies could be added to KBD. Like geometric, winsorized, combined (Uniform+Quantiles, Geometric+Uniform, Geometric+Quantiles). Winsorized binning uses interpercentile range (in my examples I used p95-p05) instead of peak-to-peak (max - min) like in Uniform. It allows algorithm to ignore outliers and save the form of distribution of the most of the mass of data. Geometric binning uses incremental binning. This technique allows to “deskew” the distribution (or skew it if it was symmetric). It is used in some GIS packages and could be beneficial for regression models. Uniquant is Uniform+Quantiles (CatBoost also has this one). It is just an average between Uniform and Quantile bins. Same thing for Geouni and Geoquant. I did not plot Quantile strategy, since it will be always uniform distribution. I used n_bins=31 with N=10_000.

Quant_strats_Bimodal_10000 Quant_strats_Exponential_10000 Quant_strats_LogNormal_10000 Quant_strats_Normal_10000

More info about winsorized binning could be found here. It will require additional parameter and the effect of this binning technique could be achieved by different strategies without additional hyperparameters, so this one is debatable. More info about geometric binning (and other interesting techniques) could be found here

Results for LogisticRegression, RandomForestClassifier, Ridge and RandomForestRegressor. Datasets generated with make_classification and make_regression with n_samples=1000 and n_bins=10 (default value). Combined here is Uniform+Quantile. I did not test other combos mentioned above.

R2_regression_forest R2_regression_linear ROC_AUC_classification_forest ROC_AUC_classification_linear

All these new algorithms are easy to implement (just add another elif and two lines of code for each one) and are fast to compute, since no fitting is necessary.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
glemaitrecommented, Jan 24, 2021

Instead of programming new strategies, could we allow to pass an arbitrary strategy as a callable?

1reaction
lorentzenchrcommented, Jan 23, 2021

I’m curious to know more about concrete/real-life use cases with(in) scikit-learn in order to evaluate the possible added value.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Demonstrating the different strategies of KBinsDiscretizer
This example presents the different strategies implemented in KBinsDiscretizer: 'uniform': The discretization is uniform in each feature, which means that ...
Read more >
17: Scikit-learn 14: Preprocessing 14: KBinsDiscretizer()
The video discusses the code to implement KBinsDiscretizer () in ... Data 00:57 - KBinsDiscretizer : encode='ordinal', strategy ='uniform' ...
Read more >
Intuition for Binning, KBinsDiscretizer - 16: Scikit-learn 13
The video discusses the intuition behind binning and KBinsDiscretizer in Scikit-learn in Python.Timeline(Python 3.8)00:00 - Outline of ...
Read more >
Can sklearn.preprocessing.KBinsDiscretizer with strategy ...
KBinsDiscretizer (n_bins=10, encode='ordinal') to discretize my continuous feature. The strategy is 'quantile' , by defalut. But my data ...
Read more >
pyts.preprocessing.KBinsDiscretizer - Read the Docs
class pyts.preprocessing. KBinsDiscretizer (n_bins=5, strategy='quantile', raise_warning=True)[source]¶. Bin continuous data into intervals sample-wise.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found