question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

chi2 should support categorical data other than binary or document count

See original GitHub issue

Describe the bug

When looking for correlation between features (for feature selection), I found that sklearn implementation of Chi2 test of independence produce significantly different results from scipy.stats implementation.

My sample data contains 300 records, with 6 anonymized categorical features and the label. My focus is on the feature A. This data is available in this folder in github . The file sample300.csv has the file, while the file chi2_showcase.ipynb has the code demonstrating the mismatch.

For the feature A, sklearn’s SelectKBest() returned the lowest ranking, suggesting there is no correlation between A and the target. But scipy.stats.chi2_contingency() returned very different result, suggesting the correlation is very high.

Because of mismatch between the two, I went a long way performing a number of different tests described in detail in this article The results suggest that the scipy implementation is correct, while sklearn implementation is incorrect.

Steps/Code to Reproduce

Please see the two links given above, where I provided the full source code and the results. The piece of code is quite standard:

fs = SelectKBest(score_func=skfs.chi2, k = 'all')
X, y = df[cat_feature_cols], df[label]
selector = fs.fit(X, y)
kbest = pd.DataFrame({'feature': X.columns, 'score': fs.scores_})
kbest.sort_values(by = 'score', ascending = False).reset_index()

Expected Results

I would expect that sklearn.feature_selection.SelectKBest(score_func=skfs.chi2) returns same, or at least similar results (p-value and chi2 statistics) as scipy.stats.chi2_contingency() . In the particular case of feature A from my set, these expected results are:

chi2 = 127.497517
p-value = 1.445816e-29

Actual Results

For the feature A in my set, sklearn.feature_selection.chi2() (encapsulated inside SelectKBest(score_func=skfs.chi2)) returned lowest rank of all features, suggesting no correlation. Feature A has score 1.412797, while other features score between 1647 and 24.

In contrast, scipy.stats.chi2_contingency() gave highest rank to feature A, suggesting high correlation. The other tests described in the article suggest that the latter is correct.

Versions

System:
    python: 3.7.7 (default, May  6 2020, 11:45:54) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\pplaszczak\AppData\Local\Continuum\anaconda3\python.exe
   machine: Windows-10-10.0.18362-SP0

Python dependencies:
          pip: 21.0.1
   setuptools: 52.0.0.post20210125
      sklearn: 0.24.1
        numpy: 1.19.2
        scipy: 1.6.2
       Cython: 0.29.22
       pandas: 1.2.3
   matplotlib: 3.3.4
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
solegallicommented, Jun 26, 2022

Hello, I am chipping in a bit late, you guys have already discussed a lot on the issue.

I think, many users, like me, would think that sklearn’s chi2 can be used interchangeably with scipy’s chi_square. It’s a bit dangerous.

Pearson’s chi square sometimes referred to as Fisher’s score in feature selection literature (although they are not strictly the same), are commonly used to select categorical features in classification. So, without thinking too much, I expected sklearn’s to do that.

Unless you run side by side the chi_square from scipy with the chi from sklearn, you wouldn’t notice that they are not examining the same thing. In fact, digging in the source code, it almost looks identical until you go into how the expected and the observed frequencies are calculated.

Bottom line, I think for “distracted” users like myself 2 things would be useful:

  • clearly state in the docs that this method is not the equivalent of Pearson’s chi square for categorical variables in classification.
  • maybe add an implementation for Pearson’s chi square? or show how scipy’s could be used with selectKBest? if at all possible?

In @glemaitre comment there are 3 bullets, either modify the docs (1), or expand the class to accommodate pearson’s test (2,3). Has anything been decided? is there a PR? maybe I could give it a go?

0reactions
altanovacommented, Nov 29, 2021

also, to make this clear, from your code snippet I understood that the bug lies deeper than in SelectKBest. The problem concerns sklearn.feature_selection.chi2, and that’s where the fix needs to be applied. The fact that it also exhibits in SelectKBest is secondary, as the latter method is a wrapper to the former. This implies that until the fix is applied, the correct bypass is to use scipy. This wasn’t clear to me at first.

Read more comments on GitHub >

github_iconTop Results From Across the Web

SPSS Tutorials: Chi-Square Test of Independence - LibGuides
The Chi-Square Test of Independence determines whether there is an association between categorical variables (i.e., whether the variables ...
Read more >
Chi-Square Test for Feature Selection in Machine learning
Chi-Square measures how expected count E and observed count O deviates each other.
Read more >
A Gentle Introduction to the Chi-Squared Test for Machine ...
The chi-squared test can compare an observed contingency table to an expected table and determine if the categorical variables are independent.
Read more >
Chi-Square Test of Independence and an Example
Alternative hypothesis: There are relationships between the categorical variables. Knowing the value of one variable does help you predict the value of another...
Read more >
Chi-Square Test of Independence - University of Texas at Austin
If two categorical variables are independent, then the value of one variable does not change the probability distribution of the other. If two...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found