question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Neighbors with `metric='jaccard'` breaks clustering

See original GitHub issue

Hello!

I found an odd bug where computing sc.pp.neighbors() with metric='jaccard' results in random cluster assignments coming out of sc.tl.louvain(). Running with the euclidean distance metric yields appropriate cluster assignments from sc.tl.louvain().

Reproduce (adata is a bone marrow data set, with “ground truth” cell type annotations in adata.obs['cell_type']:

sc.pl.tsne(adata, color='cell_type', title='Ground truth') # Color tsne plot by ground truth cell annotations
plt.show()
plt.clf()

sc.pp.neighbors(adata, metric='jaccard', random_state=2018)  # compute neighbor graph with jaccard metric
sc.tl.louvain(adata,random_state=2018) # Then use the Louvain algorithm to identify clusters
sc.pl.tsne(adata, color='louvain', title='Louvain + jaccard metric')
plt.show()
plt.clf()

sc.pp.neighbors(adata,metric='euclidean', random_state=2018) # compute neighbor graph with euclidean distance metric
sc.tl.louvain(adata, random_state=2018) # Rerun cluster identification
sc.pl.tsne(adata, color='louvain',  title='Louvain + default metric')
plt.show()
plt.clf()

image

Thanks! Let me know if you need another set of eyes in tracking this one down 😃

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
batsoncommented, Jun 25, 2018

Thanks for the clarification, @falexwolf. The documentation is clear, not sure how I missed it. I agree that euclidean distance is well-approximated by PCA (as long as populations are sufficiently large). For other metrics, that may not be the case (and for Jaccard it bails hard), and so maybe a warning would be appropriate in those cases rather than changing the behavior of choose_representation.

2reactions
falexwolfcommented, Jun 25, 2018

Yes, Joshua, thank you… it makes sense that it takes the nonzero overlap - this is what I meant with “boolean gene expression”. 🙂 And yes, on PCA this does not make sense at all.

OK, you dug out the private function tools._utils.choose_representation. This returns the PCA representation if the data matrix has more than 50 variables: See the documentation of the use_rep argument here and the code https://github.com/theislab/scanpy/blob/8e06ff6ecfab892240b58d2206e461685216a926/scanpy/tools/_utils.py#L22-L43. This behavior is intended as it is rarely advisable to compute distances on an uncompressed data matrix with more than 50 dimensions. Don’t you think so? If .X is already a 100-dimensional compressed latent representation of another model, then, of course, a PCA on top of that could be nonsense - here I’d agree.

Read more comments on GitHub >

github_iconTop Results From Across the Web

An Improved Hierarchical Clustering Algorithm Based on the ...
Abstract: Aiming to resolve the problems of the traditional hierarchical clustering algorithm that cannot find clusters with uneven density, ...
Read more >
Jaccard Coefficient - an overview | ScienceDirect Topics
The Jaccard coefficient is a measure of the percentage of overlap between sets ... In the case of segmentation analysis, multiple evaluation metrics...
Read more >
Optimal Fully Dynamic k-Centers Clustering - arXiv
We present the first algorithm for fully dynamic k-centers clustering in an arbitrary metric space that maintains an optimal 2 + ǫ ...
Read more >
Cluster Analysis: Basic Concepts and Algorithms
Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should ......
Read more >
(Shared) Nearest-neighbor graph construction - Seurat
Can also optionally (via compute.SNN ), construct a shared nearest neighbor graph by calculating the neighborhood overlap (Jaccard index) between every cell and ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found