Enhance binning strategy
See original GitHub issueResults are comparable to LightGBM when n_samples <= n_bins because both libs are using the actual feature values as bin thresholds.
This is not the case anymore when n_samples > n_bins. In particular, on this very easy dataset (target = X[:, 0] > 0, lightgbm finds a perfect threshold of 1e-35 while that of pygbm is -0.262. This leads to different trees and less accurate predictions (1 vs .9).
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
import numpy as np
from pygbm import GradientBoostingMachine
from lightgbm import LGBMClassifier
from pygbm.plotting import plot_tree
rng = np.random.RandomState(seed=2)
n_leaf_nodes = 5
n_trees = 1
lr = 1.
min_samples_leaf = 1
max_bins = 5
n_samples = 100
X = rng.normal(size=(n_samples, 5))
y = (X[:, 0] > 0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
pygbm_model = GradientBoostingMachine(
loss='log_loss', learning_rate=lr, max_iter=n_trees, max_bins=max_bins,
max_leaf_nodes=n_leaf_nodes, random_state=0, scoring=None, verbose=1,
validation_split=None, min_samples_leaf=min_samples_leaf)
pygbm_model.fit(X_train, y_train)
predicted_test = pygbm_model.predict(X_test)
acc = accuracy_score(y_test, predicted_test)
print(acc)
lightgbm_model = LGBMClassifier(
objective='binary', n_estimators=n_trees, max_bin=max_bins,
num_leaves=n_leaf_nodes, learning_rate=lr, verbose=10, random_state=0,
boost_from_average=False, min_data_in_leaf=min_samples_leaf)
lightgbm_model.fit(X_train, y_train)
predicted_test = lightgbm_model.predict(X_test)
acc = accuracy_score(y_test, predicted_test)
print(acc)
plot_tree(pygbm_model, lightgbm_model, view=True)

Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Binning for Feature Engineering in Machine Learning
Binning is a technique that accomplishes exactly what it sounds like. It will take a column with continuous numbers and place the numbers...
Read more >Binning in Data Mining - GeeksforGeeks
Data binning, bucketing is a data pre-processing method used to minimize the effects of small observation errors. The original data values ...
Read more >Improving metagenomic binning results with overlapped bins ...
In this paper, we present GraphBin2, the new generation of GraphBin, to improve binning results using the assembly graph.
Read more >Introduction to Data Binning - Minitab
This guide provides a detailed introduction to the automated binning of data. Page 2. 2. © 2019 Minitab, LLC. All rights reserved.
Read more >Getting Started with Feature Engineering - Analytics Vidhya
Binning is a way to convert numerical continuous variables into discrete variables by categorizing them on the basis of the range of values...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

It’s not just the split gain that is different on the left root child: it’s also not splitting on the same feature.
Nice catch.