What is the mathematical logic of running UMAP multiple times?
See original GitHub issueHi! I love UMAP and use it extensively for my analysis.
I’m developing new single-cell visualization methods and came up with a particularly interesting Riemannian-manifold learning solution. I’m then using UMAP on this data, rather than using traditional PCA metrics.
While running UMAP in these settings, it happened to me that data intrinsic dimensionality was too high for a more intuitive interpretation of the data even with 3D UMAP plots. I then tried running UMAP with 100 components and then gave these as an input to a new UMAP embedding.
#In Seurat v3:
neu <- RunUMAP(neu, reduction = 'Beltrami', dims = 1:100, min.dist = 0.5, reduction.key = 'dbMAP_', n.components = 100)
dual <- neu@reductions$umap@cell.embeddings
neu@reductions$dual@cell.embeddings <- dual
neu <- RunUMAP(neu, reduction = 'dual', dims = 1:100, min.dist = 0.5, reduction.key = 'dbMAP_', n.components = 3)
triple <- neu@reductions$umap@cell.embeddings
neu@reductions$triple <- neu@reductions$dual
rownames(triple) <- colnames(neu)
neu@reductions$triple@cell.embeddings <- triple
neu <- RunUMAP(neu, reduction = 'triple', dims = 1:100, min.dist = 0.5, reduction.key = 'dbMAP_', n.components = 3)
I’ve uploaded the resulting plots here. Surprisingly, initial clusters were more aligned after this. Although the results are reasonably similar, they do differ in the resulting branching resolution. This is interesting.
However, as I don’t understand the internal mechanics of UMAP, I’m unsure if visualizing data through these ‘multi-embedded-embeddings’ is reliable at all. What is the mathematical logic of running UMAP multiple times? Is there any logic at all in it?
Any clues from the developers or the UMAP community?
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (1 by maintainers)
Top GitHub Comments
UMAP doesn’t preserve all distances – that is, for example, something that MDS tries to do, with the caveat that managing to achieve that usually results in sub-optimal embeddings for many datasets. In particular this is because there are many more large distances between datapoints (in an all-pairs distance sense) and thus much more emphasis is ultimately placed on this and thus the optimization doesn’t do entirely what one might want (unless, of course, preservation of all distacnes as much as possible is exactly what you want).
UMAP tries to preserve local distances as much as possible, while trying to retain some of the global structure if it can. Even with that said the uniform distribution assumption, along with the local connectivity assumption create significant distortions to the ambient distance structure, so what UMAP is preserving is not high dimensional distances, but rather something else again – an approximation of manifold distances under certain assumptions.
On top of all of that UMAP pursues an stochastic optimization approach on a non-convex optimization problem and certainly does not achieve the optimum. So the output does not even preserve the local distorted distances described above. What UMAP is preserving is topological structure, up to a point. And ultimately it will tend to accentuate the structure it does find. That means that running UMAP multiple times will essentially be enhancing that accentuation of topological structures found. From a practical point of view that means that running UMAP on the output of UMAP will tend to increase the definition of connected component structure (i.e. increase the clustering), and collapse noisy structures down to simpler structures, hence the accentuation of linear strand like structures and loops that you see in your example.
Maybe that section helps you: https://umap-learn.readthedocs.io/en/latest/faq.html#can-i-cluster-the-results-of-umap
Especially this part:
… with its uniform density assumption, does not preserve density well. What UMAP will do, however, is contract connected components of the manifold together.
So you are basically doing that twice. UMAP also uses gradient descent so doing your same experiment multiple times may lead to different results.
In general readthedocs is very well written and clarifies a lot.