question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Problems with the parameter `learning_rate` in HistGradientBoostingClassifier

See original GitHub issue

Describe the bug

Setting the argument learning_rate to a value larger than 0.1 in HistGradientBoostingClassifier encounters large performance degradation.

Steps/Code to Reproduce


import time
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_svmlight_file

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

if __name__ == '__main__':
    
    n_estimators = 100
    learning_rate = 0.1
    
    seed = 0
    n_jobs =6
    
    train = load_svmlight_file('../../../Dataset/libsvm/letter_training')
    test = load_svmlight_file('../../../Dataset/libsvm/letter_testing')
    
    X_train, y_train = np.asanyarray(train[0].toarray(), order='F'), train[1]-1
    X_test, y_test = np.asanyarray(test[0].toarray(), order='C'), test[1]-1

    """ XGBoost (Ver==1.1.1) """
    model = XGBClassifier(n_estimators=n_estimators,
                          learning_rate=learning_rate,
                          objective='multi:softmax',
                          random_state=seed,
                          n_jobs=n_jobs)
    
    tic = time.time()
    model.fit(X_train, y_train)
    toc = time.time()
    training_time = toc - tic
    
    tic = time.time()
    y_pred = model.predict(X_test)
    toc = time.time()
    evaluating_time = toc - tic
    
    acc = accuracy_score(y_test, y_pred)
    
    print('XGBoost Testing Acc: {:.4f}%'.format(100.*acc))
    print('XGBoost Training Time: {:.4f} s'.format(training_time))
    print('XGBoost Evaluating Time: {:.4f} s\n'.format(evaluating_time))
    
    """ LightGBM (Ver==2.3.1) """
    model = LGBMClassifier(n_estimators=n_estimators,
                           learning_rate=learning_rate,
                           objective='multiclass',
                           random_state=seed,
                           n_jobs=n_jobs)
    
    tic = time.time()
    model.fit(X_train, y_train)
    toc = time.time()
    training_time = toc - tic
    
    tic = time.time()
    y_pred = model.predict(X_test)
    toc = time.time()
    evaluating_time = toc - tic
    
    acc = accuracy_score(y_test, y_pred)
    
    print('LightGBM Testing Acc: {:.4f}%'.format(100.*acc))
    print('LightGBM Training Time: {:.4f} s'.format(training_time))
    print('LightGBM Evaluating Time: {:.4f} s\n'.format(evaluating_time))

    """ Sklearn-GBDT (Ver==0.22.1) """
    model = HistGradientBoostingClassifier(max_iter=n_estimators,
                                           learning_rate=learning_rate,
                                           validation_fraction=None,
                                           random_state=seed)
    
    tic = time.time()
    model.fit(X_train, y_train)
    toc = time.time()
    training_time = toc - tic
    
    tic = time.time()
    y_pred = model.predict(X_test)
    toc = time.time()
    evaluating_time = toc - tic
    
    acc = accuracy_score(y_test, y_pred)
    
    print('Sklearn Testing Acc: {:.4f}%'.format(100.*acc))
    print('Sklearn Training Time: {:.4f} s'.format(training_time))
    print('Sklearn Evaluating Time: {:.4f} s'.format(evaluating_time))

Expected Results

I expect the performance of HistGradientBoostingClassifier with learning_rate=0.3 to be slightly different from the case with learning_rate=0.1, either better or worse, instead of a huge degradation.

Actual Results

On the letter dataset publicly available in LIBSVM dataset, HistGradientBoostingClassifier achieves a testing accuracy of 95.74% with learning_rate=0.1, yet the accuracy is 6.16% and 6.06% with learning_rate=0.3 and 0.5, separately. Similar situations on other datasets like USPS.

Versions

sklearn: 0.22.1 numpy: 1.18.1 scipy: 1.4.1 Cython: 0.29.15 pandas: 1.0.1 matplotlib: 3.1.3 joblib: 0.14.1

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
xuyxucommented, Jun 29, 2020

I also observe a huge performance degradation on LightGBM and XGBoost after using get_equivalent_model to pass parameters. If this is the expected behavior, this issue can be closed 😃. Thanks.

0reactions
NicolasHugcommented, Jun 29, 2020

I cannot reproduce your results @AaronX121 : much like sklearn, lightgbm gets very degraded performance when using a learning rate higher than 0.1 (I haven’t tried XGBoost) and when setting comparable hyperparameters with get_equivalent_model

I suspect that the discrepancy you have comes from different ways of handling early stopping though I haven’t looked in details.

Also, note that 0.1 seems like the upper limit for the LR: setting it to 0.001 gets you decent results.

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.ensemble.HistGradientBoostingClassifier
For multiclass classification problems, 'log_loss' is also known as multinomial deviance or categorical crossentropy. Internally, the model fits one tree per ...
Read more >
Hyperparameter tuning by grid-search — Scikit-learn course
Here we will use a tree-based model as a classifier (i.e. HistGradientBoostingClassifier ). That means: Numerical variables don't need scaling;. Categorical ...
Read more >
Gradient Boosting | Hyperparameter Tuning Python
Learn parameter tuning in gradient boosting algorithm using Python ... This technique is followed for a classification problem while a ...
Read more >
Tune Learning Rate for Gradient Boosting with XGBoost in ...
A problem with gradient boosted decision trees is that they are quick to learn and overfit training data. One effective way to slow...
Read more >
Deep Dive into scikit-learn's HistGradientBoosting Classifier ...
... we will explore scikit-learn's implementation of histogram-based GBDT called HistGradientBoostingClassifier /Regressor and how it ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found