chi2 should support categorical data other than binary or document count
See original GitHub issueDescribe the bug
When looking for correlation between features (for feature selection), I found that sklearn implementation of Chi2 test of independence produce significantly different results from scipy.stats implementation.
My sample data contains 300 records, with 6 anonymized categorical features and the label. My focus is on the feature A. This data is available in this folder in github . The file sample300.csv has the file, while the file chi2_showcase.ipynb has the code demonstrating the mismatch.
For the feature A, sklearn’s SelectKBest() returned the lowest ranking, suggesting there is no correlation between A and the target. But scipy.stats.chi2_contingency() returned very different result, suggesting the correlation is very high.
Because of mismatch between the two, I went a long way performing a number of different tests described in detail in this article The results suggest that the scipy implementation is correct, while sklearn implementation is incorrect.
Steps/Code to Reproduce
Please see the two links given above, where I provided the full source code and the results. The piece of code is quite standard:
fs = SelectKBest(score_func=skfs.chi2, k = 'all')
X, y = df[cat_feature_cols], df[label]
selector = fs.fit(X, y)
kbest = pd.DataFrame({'feature': X.columns, 'score': fs.scores_})
kbest.sort_values(by = 'score', ascending = False).reset_index()
Expected Results
I would expect that sklearn.feature_selection.SelectKBest(score_func=skfs.chi2) returns same, or at least similar results (p-value and chi2 statistics) as scipy.stats.chi2_contingency() . In the particular case of feature A from my set, these expected results are:
chi2 = 127.497517
p-value = 1.445816e-29
Actual Results
For the feature A in my set, sklearn.feature_selection.chi2() (encapsulated inside SelectKBest(score_func=skfs.chi2)) returned lowest rank of all features, suggesting no correlation. Feature A has score 1.412797, while other features score between 1647 and 24.
In contrast, scipy.stats.chi2_contingency() gave highest rank to feature A, suggesting high correlation. The other tests described in the article suggest that the latter is correct.
Versions
System:
python: 3.7.7 (default, May 6 2020, 11:45:54) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\pplaszczak\AppData\Local\Continuum\anaconda3\python.exe
machine: Windows-10-10.0.18362-SP0
Python dependencies:
pip: 21.0.1
setuptools: 52.0.0.post20210125
sklearn: 0.24.1
numpy: 1.19.2
scipy: 1.6.2
Cython: 0.29.22
pandas: 1.2.3
matplotlib: 3.3.4
joblib: 1.0.1
threadpoolctl: 2.1.0
Built with OpenMP: True
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:9 (5 by maintainers)
Top GitHub Comments
Hello, I am chipping in a bit late, you guys have already discussed a lot on the issue.
I think, many users, like me, would think that sklearn’s chi2 can be used interchangeably with scipy’s chi_square. It’s a bit dangerous.
Pearson’s chi square sometimes referred to as Fisher’s score in feature selection literature (although they are not strictly the same), are commonly used to select categorical features in classification. So, without thinking too much, I expected sklearn’s to do that.
Unless you run side by side the chi_square from scipy with the chi from sklearn, you wouldn’t notice that they are not examining the same thing. In fact, digging in the source code, it almost looks identical until you go into how the expected and the observed frequencies are calculated.
Bottom line, I think for “distracted” users like myself 2 things would be useful:
In @glemaitre comment there are 3 bullets, either modify the docs (1), or expand the class to accommodate pearson’s test (2,3). Has anything been decided? is there a PR? maybe I could give it a go?
also, to make this clear, from your code snippet I understood that the bug lies deeper than in SelectKBest. The problem concerns sklearn.feature_selection.chi2, and that’s where the fix needs to be applied. The fact that it also exhibits in SelectKBest is secondary, as the latter method is a wrapper to the former. This implies that until the fix is applied, the correct bypass is to use scipy. This wasn’t clear to me at first.