Neighbors with `metric='jaccard'` breaks clustering
See original GitHub issueHello!
I found an odd bug where computing sc.pp.neighbors()
with metric='jaccard'
results in random cluster assignments coming out of sc.tl.louvain()
. Running with the euclidean
distance metric yields appropriate cluster assignments from sc.tl.louvain()
.
Reproduce (adata is a bone marrow data set, with “ground truth” cell type annotations in adata.obs['cell_type']
:
sc.pl.tsne(adata, color='cell_type', title='Ground truth') # Color tsne plot by ground truth cell annotations
plt.show()
plt.clf()
sc.pp.neighbors(adata, metric='jaccard', random_state=2018) # compute neighbor graph with jaccard metric
sc.tl.louvain(adata,random_state=2018) # Then use the Louvain algorithm to identify clusters
sc.pl.tsne(adata, color='louvain', title='Louvain + jaccard metric')
plt.show()
plt.clf()
sc.pp.neighbors(adata,metric='euclidean', random_state=2018) # compute neighbor graph with euclidean distance metric
sc.tl.louvain(adata, random_state=2018) # Rerun cluster identification
sc.pl.tsne(adata, color='louvain', title='Louvain + default metric')
plt.show()
plt.clf()
Thanks! Let me know if you need another set of eyes in tracking this one down 😃
Issue Analytics
- State:
- Created 5 years ago
- Comments:7 (2 by maintainers)
Top Results From Across the Web
An Improved Hierarchical Clustering Algorithm Based on the ...
Abstract: Aiming to resolve the problems of the traditional hierarchical clustering algorithm that cannot find clusters with uneven density, ...
Read more >Jaccard Coefficient - an overview | ScienceDirect Topics
The Jaccard coefficient is a measure of the percentage of overlap between sets ... In the case of segmentation analysis, multiple evaluation metrics...
Read more >Optimal Fully Dynamic k-Centers Clustering - arXiv
We present the first algorithm for fully dynamic k-centers clustering in an arbitrary metric space that maintains an optimal 2 + ǫ ...
Read more >Cluster Analysis: Basic Concepts and Algorithms
Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should ......
Read more >(Shared) Nearest-neighbor graph construction - Seurat
Can also optionally (via compute.SNN ), construct a shared nearest neighbor graph by calculating the neighborhood overlap (Jaccard index) between every cell and ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks for the clarification, @falexwolf. The documentation is clear, not sure how I missed it. I agree that euclidean distance is well-approximated by PCA (as long as populations are sufficiently large). For other metrics, that may not be the case (and for Jaccard it bails hard), and so maybe a warning would be appropriate in those cases rather than changing the behavior of
choose_representation
.Yes, Joshua, thank you… it makes sense that it takes the nonzero overlap - this is what I meant with “boolean gene expression”. 🙂 And yes, on PCA this does not make sense at all.
OK, you dug out the private function
tools._utils.choose_representation
. This returns the PCA representation if the data matrix has more than 50 variables: See the documentation of the use_rep argument here and the code https://github.com/theislab/scanpy/blob/8e06ff6ecfab892240b58d2206e461685216a926/scanpy/tools/_utils.py#L22-L43. This behavior is intended as it is rarely advisable to compute distances on an uncompressed data matrix with more than 50 dimensions. Don’t you think so? If.X
is already a 100-dimensional compressed latent representation of another model, then, of course, a PCA on top of that could be nonsense - here I’d agree.