question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Numerical columns treated as categorical

See original GitHub issue

Issue Description

Hi guys,

I heard of PPS, through your article and was curious to test it. I have tried implementing it on some data I’ve been working on.

Unfortunately, I get numerous error messages when calculating the pps matrix :

Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=4.

UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.

My guess is pps is considering my data to be categorical and therefore trying to apply classification with a huge number of labels.

Looking at how pps determines if the data is numerical or categorical, I cannot find the reason it would consider my data categorical :

  • The dtypes are int or float
  • The number of unique values is higher than 15 (except for 1 column which is equal to 15, but changing the NUMERIC_AS_CATEGORIC_BREAKPOINT constant to 10 does not resolve the problem)

Also, if I try to force the pps score to be calculated using task = ‘regression’, I get the following error :

‘DataFrame’ object has no attribute ‘dtype’

Here is my code :

import pandas as pd
import ppscore as pps

df = pd.read_csv('seattle_building_energy_benchmark.csv', sep = ';')

df.dtypes

df.nunique()

pps.NUMERIC_AS_CATEGORIC_BREAKPOINT = 10

for col in df.columns: 
    print(col)
    pps.score(df, x = 'YearBuilt', y = col, task = None)

for col in df.columns: 
    print(col)
    pps.score(df, x = 'YearBuilt', y = col, task = 'regression')

pps.matrix(df)

Is there something I am missing ? If not, would you like me to share the data with you ? (I do not know which sharing method is more convenient for you)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
8080labscommented, May 20, 2020

Yes, the data was very helpful - thank you for that!

1reaction
alexandersmedleycommented, May 20, 2020

Hi Florian,

I’m happy to learn the data helped you identify the problems 😃

I had a hint the categorical breakpoint might not work but couldn’t be sure as the for loop was acting weird. Didn’t anticipate the x = y exception !

Thanks again for providing this package and taking the time to update and support it.

Cheers,

Alexander

Read more comments on GitHub >

github_iconTop Results From Across the Web

Categorical data — pandas 1.5.2 documentation
Categoricals are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, number...
Read more >
Check which columns in DataFrame are Categorical
This approach would give the number of distinct values which would automatically distinguish categorical variables from numerical types.
Read more >
Treating predictors as numerical or categorical variable in ...
I have a set of data that I am using regression analyses on. All of the columns are numeric (as far as I...
Read more >
Categorical vs Numerical Data: 15 Key Differences & Similarities
Categorical data can be considered as unstructured or semi-structured data. It is loosely formatted with very little to no structure, and as ...
Read more >
Feature Handling: Categorical and Numerical | by Deepak Jain
A numeric variable or a categorical variable? Even though it consists of numbers, it is a categorical variable because pin code categorizes the ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found