Incorrect calculations of homogeneity, completeness and v-measure
See original GitHub issueDescription
Calculations of homogeneity, completeness and v-measure are now based on the original paper of Rosenberg & Hirschberg 2007. However, while I was doing research on fuzzy clustering evaluation techniques, I found the following paper of Utt et al. 2014 (http://www.lrec-conf.org/proceedings/lrec2014/pdf/829_Paper.pdf) which explained in a footnote that the original definitions of homogeneity and completeness contain typos. They claim it was confirmed by Rosenberg himself via personal communications.
Definitions used:
- homogeneity = 1 - H(C|K) / H©
- completeness = 1 - H(K|C) / H(K)
Corrected definitions:
- homogeneity = 1 - H(C|K) / H(C,K)
- completeness = 1 - H(K|C) / H(K,C)
Furthermore, since the calculations are now based on the mutual information score, this wouldn’t be correct anymore. Also, the statement in the documentation about it being the same as normalized mutual information with the metric set to ‘arithmetic’ would be false.
Steps/Code to Reproduce
from sklearn.metrics import homogeneity_completeness_v_measure
Expected Results
Actual Results
Versions
System: python: 3.6.7 (v3.6.7:6ec5cf24b7, Oct 20 2018, 13:35:33) [MSC v.1900 64 bit (AMD64)] executable: C:\Users\dtuser\AppData\Local\Programs\Python\Python36\python.exe machine: Windows-7-6.1.7601-SP1 BLAS: macros: lib_dirs: cblas_libs: cblas Python deps: pip: 18.1 setuptools: 40.6.3 sklearn: 0.20.1 numpy: 1.15.4 scipy: 1.1.0 Cython: None pandas: 0.23.4
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Well, I did some tests yesterday and it seems that the joint entropy does not differ that much with the ‘single’ entropy, because the conditional one is relatively small. This led only to a difference in score numbers after 2 decimal. This was however tested on a set were the score was already high (> 0.98).
I just reproduced the examples from the paper which resulted in much higher differences.
So far, it looks like they used the single entropy for their examples, at least the calculations give back the same scores. If I use the joint entropy the results differ by quite a bit.
I used the following code, which is fairly the same as the one in sklearn expect for the joint entropy addition: