Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pandas dtypes "boolean" not supported in classification target

See original GitHub issue

Describe the bug

Pandas has extended NumPy’s dtypes and these dtypes extensions are not all supported as targets in a sklearn classifier. In particular, if the target y is a Pandas “boolean” dtype, a classifier such as LogisticRegression fails whereas if the target is a numpy “bool” dtype, the classifier will not fail.

Steps/Code to Reproduce

import pandas as pd
from sklearn.linear_model import LogisticRegression

df = pd.DataFrame({"boolean_col": pd.Series([False, True, False, True], dtype="boolean"),
                   "bool_col": pd.Series([False, True, False, True], dtype="bool"),
                   "num_col": pd.Series([1, 2, 3, 4])})

clf = LogisticRegression()

Expected Results

clf.fit(df[["num_col"]], df.bool_col)
--------------------------------------------------------------------------------------------------------------
LogisticRegression()

Actual Results

clf.fit(df[["num_col"]], df.boolean_col)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-9a0c5abf260e> in <module>
----> 1 clf.fit(df[["num_col"]], df.boolean_col)

~\AppData\Local\Continuum\anaconda3\envs\lognode\lib\site-packages\sklearn\linear_model\_logistic.py in fit(self, X, y, sample_weight)
   1343                                    order="C",
   1344                                    accept_large_sparse=solver != 'liblinear')
-> 1345         check_classification_targets(y)
   1346         self.classes_ = np.unique(y)
   1347 

~\AppData\Local\Continuum\anaconda3\envs\lognode\lib\site-packages\sklearn\utils\multiclass.py in check_classification_targets(y)
    170     if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
    171                       'multilabel-indicator', 'multilabel-sequences']:
--> 172         raise ValueError("Unknown label type: %r" % y_type)
    173 
    174 

ValueError: Unknown label type: 'unknown'

Versions

System: python: 3.8.2 (default, Apr 14 2020, 19:01:40) [MSC v.1916 64 bit (AMD64)] executable: ~\AppData\Local\Continuum\anaconda3\envs\lognode\python.exe machine: Windows-10-10.0.18362-SP0

Python dependencies: pip: 20.0.2 setuptools: 46.1.3.post20200330 sklearn: 0.23.1 numpy: 1.18.1 scipy: 1.4.1 Cython: None pandas: 1.0.3 matplotlib: 3.2.1 joblib: 0.14.1 threadpoolctl: 2.1.0

Built with OpenMP: True

Issue Analytics

State:
Created 3 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

foxalecommented, Jun 1, 2021

I can confirm it’s still present in 0.24.2

0reactions

se-jaegercommented, Dec 1, 2021

Still failing with version ‘1.0.1’ when using precision_recall_curve with boolean data.

I debugged and found that in utils.multiclass.type_of_target something goes wrong. (See comment earlier.) There is still a y = np.asarray(y), which is the cause?

It’s because in utils.multiclass.type_of_target, the casting y = np.asarray(y) converts to object dtype. But since dtype boolean allows the pd.NA and is experimental, I’m not sure how sklearn should deal with it.

Top Results From Across the Web

What happened to python's ~ when working with boolean?

So pandas will make each columns only have one dtype , if not it will convert to object . After T what data...

valueerror: dataframe.dtypes for data must be int, float, bool or ...

In short, LightGBM is not compatible with "Object" type with pandas DataFrame, so you need to encode to "int, float or bool" by...

Classification — pycaret 2.3.5 documentation - Read the Docs

Setup function must be called before executing any other function. It takes two mandatory parameters: data and target . All the other parameters...

How do I select a subset of a DataFrame?

The inner square brackets define a Python list with column names, ... 886 False 887 False 888 False 889 False 890 False Name:...

cuML API Reference — cuml 22.10.00 documentation

This feature is not fully supported by cupy yet, causing incorrect values when ... For pandas' dataframes with nullable integer dtypes with missing...