Using warmstart in GradientBoostingClassifier produces inferior solution
See original GitHub issueDescription
Using the warmstart flag for the GradientBoostingClassifier
results in an inferior solution compared to fitting all base models at once. From how the docs regarding the warmstarting are written, I expect the results to be equal.
Steps/Code to Reproduce
import numpy as np
import sklearn.datasets
import sklearn.ensemble
import sklearn.model_selection
X, y = sklearn.datasets.load_digits(return_X_y=True)
rs = np.random.RandomState(42)
indices = np.arange(X.shape[0])
rs.shuffle(indices)
indices = indices[:150]
X = X[indices]
y = y[indices]
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X, y, random_state=16,
)
classifier = sklearn.ensemble.GradientBoostingClassifier(
warm_start=True, n_estimators=1, random_state=1,
)
classifier.fit(X_train, y_train)
for i in range(99):
classifier.n_estimators += 1
classifier.fit(X_train, y_train)
print(classifier.score(X_test, y_test))
classifier = sklearn.ensemble.GradientBoostingClassifier(
n_estimators=100, random_state=1,
)
classifier.fit(X_train, y_train)
print(classifier.score(X_test, y_test))
Expected Results
Two equal scores.
Actual Results
0.605263157895 0.657894736842
Versions
Linux-4.4.0-96-generic-x86_64-with-debian-stretch-sid Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:51:32) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] NumPy 1.13.3 SciPy 0.19.1 Scikit-Learn 0.19.1
Edit
The following models have the same issue:
- ExtraTreesClassifier
- SGDClassifier
- PassiveAggressiveClassifier
The RandomForestClassifier seems to not have this problem and I did not test regression algorithms but expect them to behave the same.
Issue Analytics
- State:
- Created 6 years ago
- Comments:12 (12 by maintainers)
Top Results From Across the Web
python sklearn GradientBoostingClassifier warm start error
I've used the model to train a classifier on a set of data with 1000 iterations: clf ...
Read more >Risk of mortality and cardiopulmonary arrest in critical ... - NCBI
The remaining predictors were ranked an importance inferior to 50%. ... “Warm start” (reusing the solution of the previous call to fit as...
Read more >Lab assignment: fraud detection through ensemble methods
In this assignment we will use all the skills in ensemble learning we acquire from previous exercises to build a an automated fraud ......
Read more >arXiv:2111.14514v1 [cs.LG] 29 Nov 2021
creates a pipeline given some training set Dtrain ⊂ D. The performance of A is ... the state-of-the-art side, we compare solutions with...
Read more >Learning to target advertisements at Spotify - DiVA Portal
ular, the proposed methodology uses uplift modeling, with a modified approach to handle bias in the training data, and also makes use of...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Let’s just hope that all issues will be like #10000 from now on: a fully stand-alone snippet to reproduce the problem and closed under a day 😉 !
#7071 (early stopping for Gradient boosting) was probably too big a change for a minor release. Given how you pushed 0.19.1 forward and made it happen, I really don’t think there is anything you should be sorry about.