question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can not handle Categorical in FAMD

See original GitHub issue

I am using the latest version of Pandas and Prince. When I run the following example it does not work

df = pd.DataFrame( {‘variable_1’: [4, 5, 6, 7, 11, 2, 52], ‘variable_2’: [10, 20, 30, 40, 10, 74, 10], ‘variable_3’: [100, 50, 30, 50, 19, 29, 20], ‘color’: [‘red’, ‘blue’, ‘green’, ‘blue’, ‘red’, ‘red’, ‘blue’] })

df[‘color’]=df[‘color’].astype(‘category’) model = prince.FAMD( n_components = 2, copy = True, check_input = True, engine = ‘auto’, random_state = 1 ) model.fit(df)

ValueError: Not all columns in "Categorical" group are of the same type

I have also analyzed the reason why it occurs. When it call fit of mfa it checks whether it is categorical or not by the following code:

   for name, cols in sorted(self.groups.items()):
        all_num = all(pd.api.types.is_numeric_dtype(X[c]) for c in cols)
        all_cat = all(pd.api.types.is_string_dtype(X[c]) for c in cols)
        if not (all_num or all_cat):
            raise ValueError('Not all columns in "{}" group are of the same type'.format(name))

This was ok for earlier version of pandas. But now all_cat = all(pd.api.types.is_string_dtype(X[c]) for c in cols) this part does not works for Categorical data but only for object data.

So above part probably need to be corrected by following:

   for name, cols in sorted(self.groups.items()):
        all_num = all(pd.api.types.is_numeric_dtype(X[c]) for c in cols)
        all_obj= all(pd.api.types.is_string_dtype(X[c]) for c in cols)
        all_cat= all(pd.api.types.is_categorical_dtype(X[c]) for c in cols)
        if not (all_num or all_obj or all_cat):
            raise ValueError('Not all columns in "{}" group are of the same type'.format(name))

Am I right ?

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:3
  • Comments:5

github_iconTop GitHub Comments

3reactions
xosxoscommented, Jan 1, 2022

Same issue here with pandas 1.2.4 and prince 0.7.1

If someone is wondering, a temporary fix is to switch from dtype category to dtype object

df['color'] = df['color'].astype('object')

1reaction
vilhelmpcommented, Mar 24, 2021

Nice solution. After implementing it, I got an exception due to the for-loop directly after this in mfa.py.

You referenced this in your issue: https://github.com/MaxHalford/prince/blob/988f7fe01b6e4c9476517d1939f5fe0e13deb158/prince/mfa.py#L45-L52

I implemented your fix (important to keep self.all_nums_[name] = all_num), and got a new exception referencing the for-loop after this, here: https://github.com/MaxHalford/prince/blob/988f7fe01b6e4c9476517d1939f5fe0e13deb158/prince/mfa.py#L54-L75

The problem is that when running FAMD self.all_nums_ is a dictionary with only ‘Numerical’ as key, if I understand the code correct. However self.groups was created in famd.py which is a dictionary with both ‘Numerical’ and ‘Categorical’ as keys, so when it tries to run the check if self.all_nums_[name] with name as Categorical it throws a KeyError exception since it only has ‘Numerical’ in keys.

I changed if self.all_nums_[name]: to if name =='Numerical':

This however is only a quick fix, and probably not a robust way of fixing it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

FAMD: How to generalize PCA to categorical and numerical ...
While being very effective with numerical data, this algorithm cannot take into consideration categorical data as is.
Read more >
Can principal component analysis be applied to datasets ...
My understanding is that PCA can only be applied to continuous variables. Is that correct? If it cannot be used for categorical data,...
Read more >
Preprocessing of categorical predictors in SVM, KNN and ...
Since there is no numeric predictor variables in the dataset, we don't need ... We can see that handling categorical variables using dummy ......
Read more >
imputeFAMD: Impute mixed dataset in missMDA - Rdrr.io
Can be used as a preliminary step before performing FAMD on an incomplete dataset. ... In missMDA: Handling Missing Values with Multivariate Data...
Read more >
How to handle Categorical variables? | by Huda | Geek Culture
Similarly, Gender is a type of Nominal Variable as again we cannot differentiate between Male, Female, and Others. Encoding Categorical Data:.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found