question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`DecisionTreeClassifier` became slower in v1.1 when fitting encoded variables

See original GitHub issue

Describe the bug

The evaluation of a pipeline that encodes categorical data with v1.1 takes around 8 times longer than using v1.0.2

Steps/Code to Reproduce

import numpy as np
import pandas as pd
from time import time
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer, make_column_selector

rng = np.random.RandomState(0)
n_samples, n_features = 50_000, 2
X = pd.DataFrame(rng.randn(n_samples, n_features))
X[2] = np.random.choice(
    ["male", "female", "other"], size=n_samples, p=[0.49, 0.49, 0.02]
)
X[3] = np.random.choice(
    ["jan", "feb", "mar", "apr", "may", "jun",
     "jul", "aug", "sep", "oct", "nov", "dec"],
    size=n_samples,
)
y = np.random.choice(
    [0, 1, 2], size=n_samples, p=[0.01, 0.49, 0.5]
)

preprocessor = make_column_transformer(
    (OrdinalEncoder(), make_column_selector(dtype_include=object)),
    remainder="passthrough"
)
X_transformed = preprocessor.fit_transform(X)

t0 = time()
DecisionTreeClassifier().fit(X_transformed, y)
duration = time() - t0
duration

Expected Results

~450ms

Actual Results

3s

Versions

System:
    python: 3.9.5 | packaged by conda-forge | (default, Jun 19 2021, 00:32:32)  [GCC 9.3.0]
executable: /home/arturoamor/miniforge3/envs/scikit-learn-course/bin/python
   machine: Linux-5.14.0-1036-oem-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.1.0
          pip: 21.1.3
   setuptools: 49.6.0.post20210108
        numpy: 1.21.0
        scipy: 1.7.0
       Cython: None
       pandas: 1.3.0
   matplotlib: 3.4.2
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True

threadpoolctl info:
       filepath: /home/arturoamor/miniforge3/envs/scikit-learn-course/lib/python3.9/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
         prefix: libgomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 8

       filepath: /home/arturoamor/miniforge3/envs/scikit-learn-course/lib/libopenblasp-r0.3.15.so
         prefix: libopenblas
       user_api: blas
   internal_api: openblas
        version: 0.3.15
    num_threads: 8
threading_layer: pthreads

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
lestevecommented, May 17, 2022

Wild-guessing a bit more (not a sorting algorithm expert either), and looking at https://github.com/scikit-learn/scikit-learn/pull/22868/files#diff-e2cca285e1e883ab1d427120dfa974c1ba83eb6e2f5d5f416bbd99717ca5f5fcL490-L491 which says

# Introsort with median of 3 pivot selection and 3-way partition function
# (robust to repeated elements, e.g. lots of zero features).

Maybe compared to the previous implementation our simultaneous_sort is missing the 3-way partition function , I am guessing this is explained in more details here

0reactions
glemaitrecommented, May 19, 2022

Fixed in #23410

Read more comments on GitHub >

github_iconTop Results From Across the Web

Decision Tree Classifier took 16min to fit - Stack Overflow
Probably, your dataset has way more columns after encoding, which leads to poor performance and a long training time.
Read more >
For decision trees, if your features are categorical, do ... - Quora
You replace the categorical variable by different boolean variables (taking value 0 or 1) to encode whether or not the categorical value had...
Read more >
How To Implement The Decision Tree Algorithm From Scratch ...
Below provides a list of the five variables in the dataset. ... Below is an example that uses a hard-coded decision tree with...
Read more >
Build, train and evaluate models with TensorFlow Decision ...
During training TFDF models can self evaluate even if no validation dataset is provided to the fit() method. The exact logic depends on...
Read more >
sklearn.tree.DecisionTreeClassifier
If None, then max_features=n_features . Deprecated since version 1.1: The "auto" option was deprecated in 1.1 and will be removed in 1.3 ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found