Using a RandomForest's `warm_start` together with `random_state` is poorly documented
See original GitHub issueDescribe the issue linked to the documentation
Consider the following example:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
x, y = make_classification(random_state=0)
rf = RandomForestClassifier(n_estimators=1, warm_start=True, random_state=0)
rf.fit(x, y)
rf.n_estimators += 1
rf.fit(x, y)
According to controlling randomness, when random_state
is set:
If an integer is passed, calling fit or split multiple times always yields the same results.
But calling fit
multiple times in a warm start setting does not yield the same results (as expected, we want more trees, and we want different trees). The example above produces a forest with two unique trees, and the overall forest is identical to creating at once with RandomForestClassifier(n_estimators=2, warm_start=False, random_state=0)
. The same behavior is observed when a numpy.random.RandomState
is used.
However, I found it (at first) impossible to determine this behavior from the documentation alone. As far as I am aware, the only hint that should have helped me is this warm_start documentation:
When warm_start is true, the existing fitted model attributes are used to initialize the new model in a subsequent call to fit.
In hindsight, the internal random state
-object likely counts as a “fitted model attribute” which would allow you to infer the behavior from the documentation.
Suggest a potential alternative/fix
I am not sure if this behavior is consistent across all estimators which support the warm_start
parameter. A clarification in the warm_start
section makes the most sense to me. Either a single sentence or a small paragraph depending on whether or not there are differences between the different estimators.
I’d be willing to set up the PR but I figure it makes sense to agree on the action (if any) and wording first.
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (8 by maintainers)
Top GitHub Comments
I would start with the tree-based model in the
ensemble
module and I would prefer to have a description that is related to this type of model. It might be easier to understand than a rather general explanation that would fit all model with awarm_start
.In linear models, it is just that the optimization will start with some initial weights instead of random weights.