Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Neighbors with `metric='jaccard'` breaks clustering

See original GitHub issue

Hello!

I found an odd bug where computing sc.pp.neighbors() with metric='jaccard' results in random cluster assignments coming out of sc.tl.louvain(). Running with the euclidean distance metric yields appropriate cluster assignments from sc.tl.louvain().

Reproduce (adata is a bone marrow data set, with “ground truth” cell type annotations in adata.obs['cell_type']:

sc.pl.tsne(adata, color='cell_type', title='Ground truth') # Color tsne plot by ground truth cell annotations
plt.show()
plt.clf()

sc.pp.neighbors(adata, metric='jaccard', random_state=2018)  # compute neighbor graph with jaccard metric
sc.tl.louvain(adata,random_state=2018) # Then use the Louvain algorithm to identify clusters
sc.pl.tsne(adata, color='louvain', title='Louvain + jaccard metric')
plt.show()
plt.clf()

sc.pp.neighbors(adata,metric='euclidean', random_state=2018) # compute neighbor graph with euclidean distance metric
sc.tl.louvain(adata, random_state=2018) # Rerun cluster identification
sc.pl.tsne(adata, color='louvain',  title='Louvain + default metric')
plt.show()
plt.clf()

Thanks! Let me know if you need another set of eyes in tracking this one down 😃

Issue Analytics

State:
Created 5 years ago
Comments:7 (2 by maintainers)

Top GitHub Comments

2reactions

batsoncommented, Jun 25, 2018

Thanks for the clarification, @falexwolf. The documentation is clear, not sure how I missed it. I agree that euclidean distance is well-approximated by PCA (as long as populations are sufficiently large). For other metrics, that may not be the case (and for Jaccard it bails hard), and so maybe a warning would be appropriate in those cases rather than changing the behavior of choose_representation.

2reactions

falexwolfcommented, Jun 25, 2018

Yes, Joshua, thank you… it makes sense that it takes the nonzero overlap - this is what I meant with “boolean gene expression”. 🙂 And yes, on PCA this does not make sense at all.

OK, you dug out the private function tools._utils.choose_representation. This returns the PCA representation if the data matrix has more than 50 variables: See the documentation of the use_rep argument here and the code https://github.com/theislab/scanpy/blob/8e06ff6ecfab892240b58d2206e461685216a926/scanpy/tools/_utils.py#L22-L43. This behavior is intended as it is rarely advisable to compute distances on an uncompressed data matrix with more than 50 dimensions. Don’t you think so? If .X is already a 100-dimensional compressed latent representation of another model, then, of course, a PCA on top of that could be nonsense - here I’d agree.

Top Results From Across the Web

An Improved Hierarchical Clustering Algorithm Based on the ...

Abstract: Aiming to resolve the problems of the traditional hierarchical clustering algorithm that cannot find clusters with uneven density, ...

Jaccard Coefficient - an overview | ScienceDirect Topics

The Jaccard coefficient is a measure of the percentage of overlap between sets ... In the case of segmentation analysis, multiple evaluation metrics...

Optimal Fully Dynamic k-Centers Clustering - arXiv

We present the first algorithm for fully dynamic k-centers clustering in an arbitrary metric space that maintains an optimal 2 + ǫ ...

Cluster Analysis: Basic Concepts and Algorithms

Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should ......

(Shared) Nearest-neighbor graph construction - Seurat

Can also optionally (via compute.SNN ), construct a shared nearest neighbor graph by calculating the neighborhood overlap (Jaccard index) between every cell and ......