question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pandas dtypes "boolean" not supported in classification target

See original GitHub issue

Describe the bug

Pandas has extended NumPy’s dtypes and these dtypes extensions are not all supported as targets in a sklearn classifier. In particular, if the target y is a Pandas “boolean” dtype, a classifier such as LogisticRegression fails whereas if the target is a numpy “bool” dtype, the classifier will not fail.

Steps/Code to Reproduce

import pandas as pd
from sklearn.linear_model import LogisticRegression

df = pd.DataFrame({"boolean_col": pd.Series([False, True, False, True], dtype="boolean"),
                   "bool_col": pd.Series([False, True, False, True], dtype="bool"),
                   "num_col": pd.Series([1, 2, 3, 4])})

clf = LogisticRegression()

Expected Results

clf.fit(df[["num_col"]], df.bool_col)
--------------------------------------------------------------------------------------------------------------
LogisticRegression()

Actual Results

clf.fit(df[["num_col"]], df.boolean_col)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-9a0c5abf260e> in <module>
----> 1 clf.fit(df[["num_col"]], df.boolean_col)

~\AppData\Local\Continuum\anaconda3\envs\lognode\lib\site-packages\sklearn\linear_model\_logistic.py in fit(self, X, y, sample_weight)
   1343                                    order="C",
   1344                                    accept_large_sparse=solver != 'liblinear')
-> 1345         check_classification_targets(y)
   1346         self.classes_ = np.unique(y)
   1347 

~\AppData\Local\Continuum\anaconda3\envs\lognode\lib\site-packages\sklearn\utils\multiclass.py in check_classification_targets(y)
    170     if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
    171                       'multilabel-indicator', 'multilabel-sequences']:
--> 172         raise ValueError("Unknown label type: %r" % y_type)
    173 
    174 

ValueError: Unknown label type: 'unknown'

Versions

System: python: 3.8.2 (default, Apr 14 2020, 19:01:40) [MSC v.1916 64 bit (AMD64)] executable: ~\AppData\Local\Continuum\anaconda3\envs\lognode\python.exe machine: Windows-10-10.0.18362-SP0

Python dependencies: pip: 20.0.2 setuptools: 46.1.3.post20200330 sklearn: 0.23.1 numpy: 1.18.1 scipy: 1.4.1 Cython: None pandas: 1.0.3 matplotlib: 3.2.1 joblib: 0.14.1 threadpoolctl: 2.1.0

Built with OpenMP: True

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
foxalecommented, Jun 1, 2021

I can confirm it’s still present in 0.24.2

0reactions
se-jaegercommented, Dec 1, 2021

Still failing with version ‘1.0.1’ when using precision_recall_curve with boolean data.

I debugged and found that in utils.multiclass.type_of_target something goes wrong. (See comment earlier.) There is still a y = np.asarray(y), which is the cause?

It’s because in utils.multiclass.type_of_target, the casting y = np.asarray(y) converts to object dtype. But since dtype boolean allows the pd.NA and is experimental, I’m not sure how sklearn should deal with it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What happened to python's ~ when working with boolean?
So pandas will make each columns only have one dtype , if not it will convert to object . After T what data...
Read more >
valueerror: dataframe.dtypes for data must be int, float, bool or ...
In short, LightGBM is not compatible with "Object" type with pandas DataFrame, so you need to encode to "int, float or bool" by...
Read more >
Classification — pycaret 2.3.5 documentation - Read the Docs
Setup function must be called before executing any other function. It takes two mandatory parameters: data and target . All the other parameters...
Read more >
How do I select a subset of a DataFrame?
The inner square brackets define a Python list with column names, ... 886 False 887 False 888 False 889 False 890 False Name:...
Read more >
cuML API Reference — cuml 22.10.00 documentation
This feature is not fully supported by cupy yet, causing incorrect values when ... For pandas' dataframes with nullable integer dtypes with missing...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found