Benchmark results with better parameters
See original GitHub issueUsed a laptop for a better demo benchmark:
- Intel Core i7-7700HQ (4 cores, 8 threads), unthrottled
- 32GB RAM DDR4 2400 MHz (dual channel)
- Python 3.6, scikit-learn 0.20, numba 0.40.1
Setup for the proper benchmarking:
- No LightGBM / pygbm warmup allowed
- 1 million training samples (10 million might crash on 64GB RAM? pygbm requires at least 24GB RAM for 1 million)
- 500 training iterations
- 255 leaves
- 0.05 learning rate (can change to 0.10 actually for better comparison with independent benchmarks)
The benchmark in the master branch (https://github.com/ogrisel/pygbm/blob/master/benchmarks/bench_higgs_boson.py) is way too short and doesn’t exactly test the speed of whole model due to how fast it is: there are diminishing returns when the number of iterations increases, and this is what is difficult to optimize once the tree construction is already optimized.
Results:
| Model | Time | AUC | Comments |
|---|---|---|---|
| LightGBM | 45.260s | 0.8293 | Reference, runnable with 8GB RAM. |
| pygbm | 359.101s | 0.8180 | Requires over 24GB RAM. Slower as more trees are added over time. |
Conclusion:
- pygbm is 5 to 10 times slower, but don’t consider because it is slower it is worse. It is actually very fast if we compare to 2 years ago with xgboost with exact method, and as of today we can consider it competitive in speed with xgboost exact if you have enough RAM
- pygbm requires way too much RAM, you will notice it only when using many iterations because it seems to increase linearly
To run the benchmark, one can use the following for a clean setup, not optimized for fastest performance but you have the pre-requisites (0.20 scikit-learn, 0.39 numba):
pip install lightgbm
pip install -U scikit-learn
pip install -U numba
git clone https://github.com/ogrisel/pygbm.git
cd pygbm
Before installing pygbm, change the following in line 147 of pygbm/grower (https://github.com/ogrisel/pygbm/blob/master/pygbm/grower.py#L146-L147):
node.construction_speed = (node.sample_indices.shape[0] /
node.find_split_time)
to:
node.construction_speed = (node.sample_indices.shape[0] / 1.0)
Allows to avoid the infamous divide by zero error.
Then, one can run the following:
pip install --editable .
If you have slow Internet, download HIGGS dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/00280/ then uncompress it.
Then, you may run a proper benchmark using the following (make sure to change the load_path to your HIGGS csv file):
import os
from time import time
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from pygbm import GradientBoostingMachine
from lightgbm import LGBMRegressor
import numba
import gc
n_leaf_nodes = 255
n_trees = 500
lr = 0.05
max_bins = 255
load_path = "mnt/HIGGS/HIGGS.csv"
subsample = 1000000 # Change this to 10000000 if you wish, or to None
df = pd.read_csv(load_path, header=None, dtype=np.float32)
target = df.values[:, 0]
data = np.ascontiguousarray(df.values[:, 1:])
data_train, data_test, target_train, target_test = train_test_split(
data, target, test_size=50000, random_state=0)
if subsample is not None:
data_train, target_train = data_train[:subsample], target_train[:subsample]
n_samples, n_features = data_train.shape
print(f"Training set with {n_samples} records with {n_features} features.")
# Includes warmup time penalty
print("Fitting a LightGBM model...")
tic = time()
lightgbm_model = LGBMRegressor(n_estimators=n_trees, num_leaves=n_leaf_nodes,
learning_rate=lr, silent=False)
lightgbm_model.fit(data_train, target_train)
toc = time()
predicted_test = lightgbm_model.predict(data_test)
roc_auc = roc_auc_score(target_test, predicted_test)
print(f"done in {toc - tic:.3f}s, ROC AUC: {roc_auc:.4f}")
del lightgbm_model
del predicted_test
gc.collect()
# Includes warmup time penalty
print("Fitting a pygbm model...")
tic = time()
pygbm_model = GradientBoostingMachine(learning_rate=lr, max_iter=n_trees,
max_bins=max_bins,
max_leaf_nodes=n_leaf_nodes,
random_state=0, scoring=None,
verbose=1, validation_split=None)
pygbm_model.fit(data_train, target_train)
toc = time()
predicted_test = pygbm_model.predict(data_test)
roc_auc = roc_auc_score(target_test, predicted_test)
print(f"done in {toc - tic:.3f}s, ROC AUC: {roc_auc:.4f}")
del pygbm_model
del predicted_test
gc.collect()
if hasattr(numba, 'threading_layer'):
print("Threading layer chosen: %s" % numba.threading_layer())
If something is missing in the script, please let me know.
Issue Analytics
- State:
- Created 5 years ago
- Comments:23 (17 by maintainers)

Top Related StackOverflow Question
@dhirschfeld there are no nested
prangeloops in pygbm so far and we don’t do any linear algebra, numpy is just used as a passive datastructure (no BLAS routines used) so composability is probably useless in this context.@NicolasHug LightGBM doesn’t have a parameter named
min_sample_leaf. Refer to https://github.com/Microsoft/LightGBM/blob/dfe0fae4ea5a412d253c25fbd997224e9243bd9a/docs/Parameters.rst#min_data_in_leaf