Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Enhance binning strategy

See original GitHub issue

Results are comparable to LightGBM when n_samples <= n_bins because both libs are using the actual feature values as bin thresholds.

This is not the case anymore when n_samples > n_bins. In particular, on this very easy dataset (target = X[:, 0] > 0, lightgbm finds a perfect threshold of 1e-35 while that of pygbm is -0.262. This leads to different trees and less accurate predictions (1 vs .9).

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
import numpy as np
from pygbm import GradientBoostingMachine
from lightgbm import LGBMClassifier
from pygbm.plotting import plot_tree

rng = np.random.RandomState(seed=2)

n_leaf_nodes = 5
n_trees = 1
lr = 1.
min_samples_leaf = 1

max_bins = 5
n_samples = 100

X = rng.normal(size=(n_samples, 5))
y = (X[:, 0] > 0)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)

pygbm_model = GradientBoostingMachine(
    loss='log_loss', learning_rate=lr, max_iter=n_trees, max_bins=max_bins,
    max_leaf_nodes=n_leaf_nodes, random_state=0, scoring=None, verbose=1,
    validation_split=None, min_samples_leaf=min_samples_leaf)
pygbm_model.fit(X_train, y_train)
predicted_test = pygbm_model.predict(X_test)
acc = accuracy_score(y_test, predicted_test)
print(acc)

lightgbm_model = LGBMClassifier(
    objective='binary', n_estimators=n_trees, max_bin=max_bins,
    num_leaves=n_leaf_nodes, learning_rate=lr, verbose=10, random_state=0,
    boost_from_average=False, min_data_in_leaf=min_samples_leaf)
lightgbm_model.fit(X_train, y_train)
predicted_test = lightgbm_model.predict(X_test)
acc = accuracy_score(y_test, predicted_test)
print(acc)

plot_tree(pygbm_model, lightgbm_model, view=True)

lol

Issue Analytics

State:
Created 5 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

ogriselcommented, Nov 8, 2018

It’s not just the split gain that is different on the left root child: it’s also not splitting on the same feature.

0reactions

ogriselcommented, Nov 20, 2018

For my second comment (#39 (comment)), the discrepancy comes from the min_data_in_bin parameter of LightGBM which is 3 by default. Setting it to 1 gives the same trees. I should have seen this sooner 😒

Nice catch.