Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using a RandomForest's `warm_start` together with `random_state` is poorly documented

See original GitHub issue

Describe the issue linked to the documentation

Consider the following example:

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

x, y = make_classification(random_state=0)
rf = RandomForestClassifier(n_estimators=1, warm_start=True, random_state=0)

rf.fit(x, y)
rf.n_estimators += 1
rf.fit(x, y)

According to controlling randomness, when random_state is set:

If an integer is passed, calling fit or split multiple times always yields the same results.

But calling fit multiple times in a warm start setting does not yield the same results (as expected, we want more trees, and we want different trees). The example above produces a forest with two unique trees, and the overall forest is identical to creating at once with RandomForestClassifier(n_estimators=2, warm_start=False, random_state=0). The same behavior is observed when a numpy.random.RandomState is used.

However, I found it (at first) impossible to determine this behavior from the documentation alone. As far as I am aware, the only hint that should have helped me is this warm_start documentation:

When warm_start is true, the existing fitted model attributes are used to initialize the new model in a subsequent call to fit.

In hindsight, the internal random state-object likely counts as a “fitted model attribute” which would allow you to infer the behavior from the documentation.

Suggest a potential alternative/fix

I am not sure if this behavior is consistent across all estimators which support the warm_start parameter. A clarification in the warm_start section makes the most sense to me. Either a single sentence or a small paragraph depending on whether or not there are differences between the different estimators.

I’d be willing to set up the PR but I figure it makes sense to agree on the action (if any) and wording first.

Issue Analytics

State:
Created 2 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

glemaitrecommented, Dec 21, 2021

Should the clarification be added to the general warm_start documentation, but clarify it is only true for ensemble? But that would be confusing if non-ensemble methods behave similarly. Alternatively, should this additional clarification be copied into the docstring of each (user-facing) ensemble class? Or is it better to wait until someone comes along with more time (and otherwise until I have more time myself in a few months) to figure out the exact behavior across all submodules?

I would start with the tree-based model in the ensemble module and I would prefer to have a description that is related to this type of model. It might be easier to understand than a rather general explanation that would fit all model with a warm_start.

0reactions

glemaitrecommented, Dec 21, 2021

how non-ensemble estimators behave.

In linear models, it is just that the optimization will start with some initial weights instead of random weights.

Top Results From Across the Web

Topic 5. Ensembles and random forest. Part 2 ... - mlcourse.ai

The random subspace method reduces the correlation between the trees and thus prevents overfitting. With bagging, the base algorithms are trained on different ......

Documentation of External and Wrapped Nodes - GitHub Pages

This node has been automatically generated by wrapping the sklearn.svm.classes.SVR class from the sklearn library. The wrapped instance can be accessed through ......

How to Fit Random Forests Faster - Towards Data Science

Use Warm Starts and Out-of-Bag Cross Validation. Random Forests are a bread and butter model/algorithm for machine learning.

Applied Machine Learning on Phase of Gait Classification and ...

We extended gait phase classification by using the biomechanics simulation ... being tracked for a cyclical movement such as gait along the sagittal...

deep-learning - Stack Exchange Data Explorer

'How to store scaling parameters for later use', 'sklearn.mixture. ... 'Random Forest Classifier Segmentation Fault', 'Find p-value ...