[Question] Is it safe to fit-transform once to get multiple embeddings?
See original GitHub issueQuick question, a bit of the mathsy side.
Suppose I want to test different embeddings of my data, say in 2, 5 and 10 dimensions;
is the UMAP algorithm amenable to fit only once using n_components=10
and then take
the first 2 or 5 components, or would this be complete nonsense and I should make separate fittings?
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (1 by maintainers)
Top Results From Across the Web
fit_transform of combined umap · Discussion #623 - GitHub
Is there a way to get embeddings of combined umaps either union or intersection for downstream actions such as hdbscan?
Read more >FAQ - BERTopic
No. By using document embeddings there is typically no need to preprocess the data as all parts of a document are important in...
Read more >Transforming New Data with UMAP - Read the Docs
The next important question is what the transform did to our test data. In principle we have a new two dimensional representation of...
Read more >UMAP: is fit_transform result the same as .embedding_ ...
I keep getting different plots for both. I'm confused because I thought they were doing the same thing. python · dimensionality-reduction.
Read more >python - Bertopic with embedding: unable to use find_topic
I get an error message indicating that I'm using embedding (which is true). I need to instantiate before I can fit_transform the model...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
(Caveat: I’m not Leland McInnes; I’m a mathematician and am comfortable with the maths behind UMAP but am not familiar with the code. Take this with a grain of salt.)
I wouldn’t expect this to work. Unlike in PCA, the coordinate system in the transformed data has no particular meaning – you get the same cost for any isometry of a given representation (in particular, for a rotation). (There’s also no reason the first two components would work any better than the last two.)
EDIT: The following paragraph is wrong. For example, say you have a 10 dimensional dataset. You can “transform” this under UMAP and you should get basically the same dataset back (up to an isometry, I think). Taking two components of this “transformation”, in general we certainly do not get the embedding you would get with n_components=2 – at least, I would hope not, or you could just restrict to two components instead of running UMAP!
Yep, you’re completely correct; I forgot that we use the Euclidean metric for the low-dimensional representation, not the knn-weighted one.