question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RandomForestClassifier parallel issues with CPU usage decreasing over run

See original GitHub issue

Description

Related or identical to issue #6023 but it seems as of 0.19.2 it’s not fixed even though that issue is closed. I encountered it not with GridSearchCV but with RFE wrapping RF. I get the exact same strange behavior where parallel CPU usage starts like it should at 100% and then steadily decreases to low numbers while system CPU usage (in Linux shown in top) increases to 10-15% CPU per core which is not normal. The fit never finishes as well (or takes way too long if it ever does finish)

Steps/Code to Reproduce

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=3200, n_informative=100, n_redundant=3100, n_classes=2, n_clusters_per_class=30)

pipe = Pipeline([
    ('slr', StandardScaler()),
    ('fs', RFE(RandomForestClassifier(n_estimators=1000, max_features='auto', class_weight='balanced', n_jobs=-1), step=0.01, n_features_to_select=10))
])
pipe.fit(X, y)

Expected Results

Parallel CPU usage to be effectively 100% on number of cores = n_jobs for each iteration of RFE and for the pipeline fit to complete in a normal time.

Actual Results

Parallel CPU usage starts like it should at 100% and then steadily decreases to low numbers while system CPU usage (in Linux shown in top) increases to 10-15% CPU per core which is not normal. The pipeline fit never finishes.

Versions

Linux-4.18.16-200.fc28.x86_64-x86_64-with-fedora-28-Twenty_Eight Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0] NumPy 1.14.3 SciPy 1.1.0 Scikit-Learn 0.19.2

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:10 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
amuellercommented, Nov 13, 2018

@hermidalc no because the bottleneck is communication between the cores.

0reactions
thomasjpfancommented, Apr 22, 2022
Read more comments on GitHub >

github_iconTop Results From Across the Web

Training a RandomForest is slow on a computing cluster
It will NOT run on cores from different machines in your cluster, this would imply knowing about the architecture of your cluster and...
Read more >
Multi-Core Machine Learning in Python With Scikit-Learn
This configuration argument allows you to specify the number of cores to use for the task. The default is None, which will use...
Read more >
Random Forest with Parallel Computing in R Programming
Parallel Computing basically refers to the usage of two or more cores (or processors) at the same instance to get the solution of...
Read more >
sklearn.ensemble.RandomForestClassifier
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses...
Read more >
parallelization.utf8
This means identifying the processors to use, what process is to be run ... One of the larger challenges with R and parallel...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found