Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

New strategies for KBinsDiscretizer

See original GitHub issue

New non-parametric strategies could be added to KBD. Like geometric, winsorized, combined (Uniform+Quantiles, Geometric+Uniform, Geometric+Quantiles). Winsorized binning uses interpercentile range (in my examples I used p95-p05) instead of peak-to-peak (max - min) like in Uniform. It allows algorithm to ignore outliers and save the form of distribution of the most of the mass of data. Geometric binning uses incremental binning. This technique allows to “deskew” the distribution (or skew it if it was symmetric). It is used in some GIS packages and could be beneficial for regression models. Uniquant is Uniform+Quantiles (CatBoost also has this one). It is just an average between Uniform and Quantile bins. Same thing for Geouni and Geoquant. I did not plot Quantile strategy, since it will be always uniform distribution. I used n_bins=31 with N=10_000.

Quant_strats_Bimodal_10000 Quant_strats_Exponential_10000 Quant_strats_LogNormal_10000 Quant_strats_Normal_10000

More info about winsorized binning could be found here. It will require additional parameter and the effect of this binning technique could be achieved by different strategies without additional hyperparameters, so this one is debatable. More info about geometric binning (and other interesting techniques) could be found here

Results for LogisticRegression, RandomForestClassifier, Ridge and RandomForestRegressor. Datasets generated with make_classification and make_regression with n_samples=1000 and n_bins=10 (default value). Combined here is Uniform+Quantile. I did not test other combos mentioned above.

R2_regression_forest R2_regression_linear ROC_AUC_classification_forest ROC_AUC_classification_linear

All these new algorithms are easy to implement (just add another elif and two lines of code for each one) and are fast to compute, since no fitting is necessary.

Issue Analytics

State:
Created 3 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

glemaitrecommented, Jan 24, 2021

Instead of programming new strategies, could we allow to pass an arbitrary strategy as a callable?

1reaction

lorentzenchrcommented, Jan 23, 2021

I’m curious to know more about concrete/real-life use cases with(in) scikit-learn in order to evaluate the possible added value.

Top Results From Across the Web

Demonstrating the different strategies of KBinsDiscretizer

This example presents the different strategies implemented in KBinsDiscretizer: 'uniform': The discretization is uniform in each feature, which means that ...

17: Scikit-learn 14: Preprocessing 14: KBinsDiscretizer()

The video discusses the code to implement KBinsDiscretizer () in ... Data 00:57 - KBinsDiscretizer : encode='ordinal', strategy ='uniform' ...

Intuition for Binning, KBinsDiscretizer - 16: Scikit-learn 13

The video discusses the intuition behind binning and KBinsDiscretizer in Scikit-learn in Python.Timeline(Python 3.8)00:00 - Outline of ...

Can sklearn.preprocessing.KBinsDiscretizer with strategy ...

KBinsDiscretizer (n_bins=10, encode='ordinal') to discretize my continuous feature. The strategy is 'quantile' , by defalut. But my data ...

pyts.preprocessing.KBinsDiscretizer - Read the Docs

class pyts.preprocessing. KBinsDiscretizer (n_bins=5, strategy='quantile', raise_warning=True)[source]¶. Bin continuous data into intervals sample-wise.