Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question] Is it safe to fit-transform once to get multiple embeddings?

See original GitHub issue

Quick question, a bit of the mathsy side.

Suppose I want to test different embeddings of my data, say in 2, 5 and 10 dimensions; is the UMAP algorithm amenable to fit only once using n_components=10 and then take the first 2 or 5 components, or would this be complete nonsense and I should make separate fittings?

Issue Analytics

State:
Created 4 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

1reaction

adelejacksoncommented, May 8, 2019

(Caveat: I’m not Leland McInnes; I’m a mathematician and am comfortable with the maths behind UMAP but am not familiar with the code. Take this with a grain of salt.)

I wouldn’t expect this to work. Unlike in PCA, the coordinate system in the transformed data has no particular meaning – you get the same cost for any isometry of a given representation (in particular, for a rotation). (There’s also no reason the first two components would work any better than the last two.)

EDIT: The following paragraph is wrong. For example, say you have a 10 dimensional dataset. You can “transform” this under UMAP and you should get basically the same dataset back (up to an isometry, I think). Taking two components of this “transformation”, in general we certainly do not get the embedding you would get with n_components=2 – at least, I would hope not, or you could just restrict to two components instead of running UMAP!

0reactions

adelejacksoncommented, May 8, 2019

I don’t think, however, that you get the “same” dataset up to isometry when n_components = n_features; from what I understood from the paper the umap algo works with local distances, and it attempts to make a reconstruction that preserve the simplicial structure, i.e. it should preserve the topology of the dataset, not its metric.

Yep, you’re completely correct; I forgot that we use the Euclidean metric for the low-dimensional representation, not the knn-weighted one.