Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

What is the mathematical logic of running UMAP multiple times?

See original GitHub issue

Hi! I love UMAP and use it extensively for my analysis.

I’m developing new single-cell visualization methods and came up with a particularly interesting Riemannian-manifold learning solution. I’m then using UMAP on this data, rather than using traditional PCA metrics.

While running UMAP in these settings, it happened to me that data intrinsic dimensionality was too high for a more intuitive interpretation of the data even with 3D UMAP plots. I then tried running UMAP with 100 components and then gave these as an input to a new UMAP embedding.


#In Seurat v3:

neu <- RunUMAP(neu, reduction = 'Beltrami', dims = 1:100, min.dist = 0.5, reduction.key = 'dbMAP_', n.components = 100)
dual <- neu@reductions$umap@cell.embeddings
neu@reductions$dual@cell.embeddings <- dual

neu <- RunUMAP(neu, reduction = 'dual', dims = 1:100, min.dist = 0.5, reduction.key = 'dbMAP_', n.components = 3)
triple <- neu@reductions$umap@cell.embeddings
neu@reductions$triple <- neu@reductions$dual
rownames(triple) <- colnames(neu)
neu@reductions$triple@cell.embeddings <- triple

neu <- RunUMAP(neu, reduction = 'triple', dims = 1:100, min.dist = 0.5, reduction.key = 'dbMAP_', n.components = 3)

I’ve uploaded the resulting plots here. Surprisingly, initial clusters were more aligned after this. Although the results are reasonably similar, they do differ in the resulting branching resolution. This is interesting.

However, as I don’t understand the internal mechanics of UMAP, I’m unsure if visualizing data through these ‘multi-embedded-embeddings’ is reliable at all. What is the mathematical logic of running UMAP multiple times? Is there any logic at all in it?

Any clues from the developers or the UMAP community?

Issue Analytics

State:
Created 4 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

2reactions

lmcinnescommented, Sep 9, 2019

UMAP doesn’t preserve all distances – that is, for example, something that MDS tries to do, with the caveat that managing to achieve that usually results in sub-optimal embeddings for many datasets. In particular this is because there are many more large distances between datapoints (in an all-pairs distance sense) and thus much more emphasis is ultimately placed on this and thus the optimization doesn’t do entirely what one might want (unless, of course, preservation of all distacnes as much as possible is exactly what you want).

UMAP tries to preserve local distances as much as possible, while trying to retain some of the global structure if it can. Even with that said the uniform distribution assumption, along with the local connectivity assumption create significant distortions to the ambient distance structure, so what UMAP is preserving is not high dimensional distances, but rather something else again – an approximation of manifold distances under certain assumptions.

On top of all of that UMAP pursues an stochastic optimization approach on a non-convex optimization problem and certainly does not achieve the optimum. So the output does not even preserve the local distorted distances described above. What UMAP is preserving is topological structure, up to a point. And ultimately it will tend to accentuate the structure it does find. That means that running UMAP multiple times will essentially be enhancing that accentuation of topological structures found. From a practical point of view that means that running UMAP on the output of UMAP will tend to increase the definition of connected component structure (i.e. increase the clustering), and collapse noisy structures down to simpler structures, hence the accentuation of linear strand like structures and loops that you see in your example.

1reaction

sleighsoftcommented, Aug 31, 2019

Maybe that section helps you: https://umap-learn.readthedocs.io/en/latest/faq.html#can-i-cluster-the-results-of-umap

Especially this part:

… with its uniform density assumption, does not preserve density well. What UMAP will do, however, is contract connected components of the manifold together.

So you are basically doing that twice. UMAP also uses gradient descent so doing your same experiment multiple times may lead to different results.

In general readthedocs is very well written and clarifies a lot.

Top Results From Across the Web

How UMAP Works — umap 0.5 documentation

UMAP is an algorithm for dimension reduction based on manifold learning techniques and ideas from topological data analysis. It provides a very general ......

UMAP: Uniform Manifold Approximation and Projection ... - arXiv

UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction. UMAP is constructed.

How Exactly UMAP Works. And why exactly it is better than tSNE

Today we are going to dive into an exciting dimension reduction technique called UMAP that dominates the Single Cell Genomics nowadays. Here, I ......

umap Documentation - Read the Docs

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for.

Decrypting Dimensionality Reduction | Analytics Vidhya

So, in real life, we run t-SNE many times and choose the best result. ... UMAP is a general-purpose dimensionality reduction technique that ......