`DecisionTreeClassifier` became slower in v1.1 when fitting encoded variables
See original GitHub issueDescribe the bug
The evaluation of a pipeline that encodes categorical data with v1.1 takes around 8 times longer than using v1.0.2
Steps/Code to Reproduce
import numpy as np
import pandas as pd
from time import time
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer, make_column_selector
rng = np.random.RandomState(0)
n_samples, n_features = 50_000, 2
X = pd.DataFrame(rng.randn(n_samples, n_features))
X[2] = np.random.choice(
["male", "female", "other"], size=n_samples, p=[0.49, 0.49, 0.02]
)
X[3] = np.random.choice(
["jan", "feb", "mar", "apr", "may", "jun",
"jul", "aug", "sep", "oct", "nov", "dec"],
size=n_samples,
)
y = np.random.choice(
[0, 1, 2], size=n_samples, p=[0.01, 0.49, 0.5]
)
preprocessor = make_column_transformer(
(OrdinalEncoder(), make_column_selector(dtype_include=object)),
remainder="passthrough"
)
X_transformed = preprocessor.fit_transform(X)
t0 = time()
DecisionTreeClassifier().fit(X_transformed, y)
duration = time() - t0
duration
Expected Results
~450ms
Actual Results
3s
Versions
System:
python: 3.9.5 | packaged by conda-forge | (default, Jun 19 2021, 00:32:32) [GCC 9.3.0]
executable: /home/arturoamor/miniforge3/envs/scikit-learn-course/bin/python
machine: Linux-5.14.0-1036-oem-x86_64-with-glibc2.31
Python dependencies:
sklearn: 1.1.0
pip: 21.1.3
setuptools: 49.6.0.post20210108
numpy: 1.21.0
scipy: 1.7.0
Cython: None
pandas: 1.3.0
matplotlib: 3.4.2
joblib: 1.0.1
threadpoolctl: 2.1.0
Built with OpenMP: True
threadpoolctl info:
filepath: /home/arturoamor/miniforge3/envs/scikit-learn-course/lib/python3.9/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
prefix: libgomp
user_api: openmp
internal_api: openmp
version: None
num_threads: 8
filepath: /home/arturoamor/miniforge3/envs/scikit-learn-course/lib/libopenblasp-r0.3.15.so
prefix: libopenblas
user_api: blas
internal_api: openblas
version: 0.3.15
num_threads: 8
threading_layer: pthreads
Issue Analytics
- State:
- Created a year ago
- Comments:9 (9 by maintainers)
Top Results From Across the Web
Decision Tree Classifier took 16min to fit - Stack Overflow
Probably, your dataset has way more columns after encoding, which leads to poor performance and a long training time.
Read more >For decision trees, if your features are categorical, do ... - Quora
You replace the categorical variable by different boolean variables (taking value 0 or 1) to encode whether or not the categorical value had...
Read more >How To Implement The Decision Tree Algorithm From Scratch ...
Below provides a list of the five variables in the dataset. ... Below is an example that uses a hard-coded decision tree with...
Read more >Build, train and evaluate models with TensorFlow Decision ...
During training TFDF models can self evaluate even if no validation dataset is provided to the fit() method. The exact logic depends on...
Read more >sklearn.tree.DecisionTreeClassifier
If None, then max_features=n_features . Deprecated since version 1.1: The "auto" option was deprecated in 1.1 and will be removed in 1.3 ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Wild-guessing a bit more (not a sorting algorithm expert either), and looking at https://github.com/scikit-learn/scikit-learn/pull/22868/files#diff-e2cca285e1e883ab1d427120dfa974c1ba83eb6e2f5d5f416bbd99717ca5f5fcL490-L491 which says
Maybe compared to the previous implementation our
simultaneous_sort
is missing the 3-way partition function , I am guessing this is explained in more details hereFixed in #23410