New strategies for KBinsDiscretizer
See original GitHub issueNew non-parametric strategies could be added to KBD. Like geometric, winsorized, combined (Uniform+Quantiles, Geometric+Uniform, Geometric+Quantiles). Winsorized binning uses interpercentile range (in my examples I used p95-p05) instead of peak-to-peak (max - min) like in Uniform. It allows algorithm to ignore outliers and save the form of distribution of the most of the mass of data. Geometric binning uses incremental binning. This technique allows to “deskew” the distribution (or skew it if it was symmetric). It is used in some GIS packages and could be beneficial for regression models. Uniquant is Uniform+Quantiles (CatBoost also has this one). It is just an average between Uniform and Quantile bins. Same thing for Geouni and Geoquant. I did not plot Quantile strategy, since it will be always uniform distribution. I used n_bins=31 with N=10_000.
More info about winsorized binning could be found here. It will require additional parameter and the effect of this binning technique could be achieved by different strategies without additional hyperparameters, so this one is debatable. More info about geometric binning (and other interesting techniques) could be found here
Results for LogisticRegression, RandomForestClassifier, Ridge and RandomForestRegressor. Datasets generated with make_classification
and make_regression
with n_samples=1000
and n_bins=10
(default value). Combined here is Uniform+Quantile. I did not test other combos mentioned above.
All these new algorithms are easy to implement (just add another elif and two lines of code for each one) and are fast to compute, since no fitting is necessary.
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (9 by maintainers)
Top GitHub Comments
Instead of programming new strategies, could we allow to pass an arbitrary strategy as a callable?
I’m curious to know more about concrete/real-life use cases with(in) scikit-learn in order to evaluate the possible added value.