Feature request: Random Forest node splitting sample size
See original GitHub issueDuring the development of a model, the faster one can iterate, the better. That means getting your training time down to a manageable level, albeit at the cost of some accuracy. With large training sets, training time for random forests can be an impediment to rapid iteration. @jph00 showed me this awesome trick that convinces scikit-learn’s RF implementation to use a subsample of the full training data set when selecting split variables and values at each node:
def set_RF_sample_size(n):
forest._generate_sample_indices = \
(lambda rs, n_samples: forest.check_random_state(rs).randint(0, n_samples,n))
def reset_RF_sample_size():
forest._generate_sample_indices = (lambda rs, n_samples:
forest.check_random_state(rs).randint(0, n_samples, n_samples))
We both find that splitting using say 20,000 samples rather than the full 400,000 samples gives nearly the same results but with much faster training time. Naturally it requires a bit of experimentation for each data set to find a suitable subsample size, but in general we find that feature engineering and other preparation work is totally fine using a subsample. We then reset the sample size for the final model.
We appreciate your consideration of this feature request!
Issue Analytics
- State:
- Created 5 years ago
- Reactions:3
- Comments:13 (8 by maintainers)
Top GitHub Comments
@cmarmo Ok, I labeled it moderate, so it is not picked up by beginners at sprints.
@reshamas, checking the pull requests that tried to solve it I’m under the impression that this issue does not have a straightforward solution. Perhaps it will require some experience to be taken over.