Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question about xgboost CV and n_estimators / num_boost_round

See original GitHub issue

Hello

I’m enjoying exploring Optuna.

I have a question on getting cross validated results with xgboost and CV with Optuna. Asking here as I couldn’t find any examples in the repo on this.

I’ve studied the example you’ve posted here: https://github.com/optuna/optuna/blob/master/examples/xgboost_simple.py.

My attempt to use xgb.CV with Optuna,

def objective(trial):

    dtrain = xgb.DMatrix(df[features], label=df.target
                         feature_names=features)

    param = {
             'silent': 1,
             'objective': 'binary:logistic',
             'eval_metric': "auc",
             'booster': trial.suggest_categorical('booster', ['gbtree']),
             'alpha': trial.suggest_loguniform('alpha', 1e-3, 1.0)
         }

    if param['booster'] == 'gbtree':
        param['max_depth'] = trial.suggest_int('max_depth', 1, 9)
        param['scale_pos_weight'] = trial.suggest_int('scale_pos_weight', 3, 75)
        param['min_child_weight'] = trial.suggest_int('min_child_weight', 1, 9)
        param['eta'] = trial.suggest_loguniform('eta', 1e-3, 1.0)
        param['gamma'] = trial.suggest_loguniform('gamma', 1e-3, 1.0)
        param['subsample'] = trial.suggest_loguniform('subsample', 0.6, 1.0)
        param['colsample_bytree'] = trial.suggest_loguniform('colsample_bytree', 0.6, 1.0)
        param['grow_policy'] = trial.suggest_categorical('grow_policy', ['depthwise', 'lossguide'])
   
    xgb_cv_results = xgb.cv(params=param, dtrain=dtrain,num_boost_round=10000,
                            nfold=3, stratified=True, early_stopping_rounds=100,
                            seed=108, verbose_eval=False)
        
    # Extract the best score
    best_score = np.mean(xgb_cv_results['test-auc-mean'])
    return best_score


sampler = TPESampler(seed=108)  
optuna_hpt = optuna.create_study(sampler=sampler, 
                                         direction='maximize',
                                         study_name='optuna_hpt')

optuna_hpt.optimize(objective, n_trials=150)

While this will give me the CV metric (auc here) and the best params e.g. below,

{'booster': 'gbtree',
 'alpha': 0.054159958811690126,
 'max_depth': 7,
 'scale_pos_weight': 16,
 'min_child_weight': 9,
 'eta': 0.0026002759893806117,
 'gamma': 0.0011140626171961645,
 'subsample': 0.667891200106278,
 'colsample_bytree': 0.6224726913934507,
 'grow_policy': 'lossguide'}

but this doesn’t tell me how many num_boost_rounds / n_estimators I need to train still since I use a large number and early stopping.

Am I right in assuming that I will need to save the cross validation (xgb.cv) results and get the n_estimators from there ? and my final model will possibly be a retrained model with the best_parameters and num_boost_round included?

what am I missing

Appreciate any help .

thanks!

Note to the questioner

If you are more comfortable with Stack Overflow, you may consider posting your question there instead. Alternatively, for issues that would benefit from more of an interactive session with the developers, you may refer to the optuna/optuna chat on Gitter.

Issue Analytics

State:
Created 3 years ago
Comments:14 (5 by maintainers)

Top GitHub Comments

5reactions

sskarkhaniscommented, Apr 27, 2020

Hello

Really appreciate the quick response. Thank you. 😃 You are right about the n_estimators. num_boost_round and n_estimators are aliases. Though in optuna I could only use n_estimators in trial.set_user_attr() and not num_boost_round (got an error message)

Based on your suggestion, I’ve modified the code now to,

# Define Optuna Objective
def objective(trial):

    dtrain = xgb.DMatrix(df[features], label=df.target
                         feature_names=features)

    param = {
             'silent': 1,
             'objective': 'binary:logistic',
             'eval_metric': "auc",
             'booster': trial.suggest_categorical('booster', ['gbtree']),
             'alpha': trial.suggest_loguniform('alpha', 1e-3, 1.0)
         }

    if param['booster'] == 'gbtree':
        param['max_depth'] = trial.suggest_int('max_depth', 1, 9)
        param['scale_pos_weight'] = trial.suggest_int('scale_pos_weight', 3, 75)
        param['min_child_weight'] = trial.suggest_int('min_child_weight', 1, 9)
        param['eta'] = trial.suggest_loguniform('eta', 1e-3, 1.0)
        param['gamma'] = trial.suggest_loguniform('gamma', 1e-3, 1.0)
        param['subsample'] = trial.suggest_loguniform('subsample', 0.6, 1.0)
        param['colsample_bytree'] = trial.suggest_loguniform('colsample_bytree', 0.6, 1.0)
        param['grow_policy'] = trial.suggest_categorical('grow_policy', ['depthwise', 'lossguide'])
   
    xgb_cv_results = xgb.cv(params=param, dtrain=dtrain,num_boost_round=10000,
                            nfold=3, stratified=True, early_stopping_rounds=100,
                            seed=108, verbose_eval=False)
    
    # (Optional)
    # Print n_estimators in the output at each call to the objective function
    print('-'*10, 'Trial {} has optimal trees: {}'.format(trial.number, str(xgb_cv_results.shape[0])), '-'*10)

    # (Optional)    
    # Save XGB results for Analysis; Update to your path by changing: file_path
    xgb_cv_results.to_csv(file_path + 'Optuna_cv_{}.csv'.format(trial.number), index=False)
    
    # Set n_estimators as a trial attribute; Accessible via study.trials_dataframe()
    trial.set_user_attr('n_estimators', len(xgb_cv_results))
    
    # Extract the best score
    best_score = xgb_cv_results.loc[xgb_cv_results.shape[0] - 1, 'test-auc-mean']

    return best_score

I’d be happy to add it as an example under https://github.com/optuna/optuna/tree/master/examples Or feel free to include it if you see fit.

4reactions

kmedvedcommented, Apr 30, 2020

I think this would be a great example to add. Calculating n_estimators for a final model after early stopping is a very common task.

Top Results From Across the Web

What is the difference between num_boost_round and ...

XGBRegressor is an implementation of the scikit-learn API; and scikit-learn conventionally uses n_estimators to refer to the number of ...

Optimize `n_estimators` using `xgb.cv` - tools

I think num_boost_round denote the value of n_estimators used (increasing from 0 to 1000, early stopped by early_stopping_rounds), but I am not ...

Extreme Gradient Boosting with XGBoost from DataCamp

You'll use xgb.cv() inside a for loop and build one model per num_boost_round parameter. Here, you'll continue working with the Ames housing ...

How to use XGboost.cv with hyperparameters optimization?

This is how I have trained a xgboost classifier with a 5-fold cross-validation to optimize the F1 score using randomized search for ...

Python API Reference — xgboost 1.7.2 documentation

xgboost.cv(params, dtrain, num_boost_round=10, nfold=3, stratified=False, folds=None, metrics=(), ... n_estimators (int) – Number of gradient boosted trees.