question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

transform function changes the embedding_

See original GitHub issue

Hi, I have noticed a weird thing, I am not sure if its the expected behavior of UMAP so I am posting this. I have realized that when I use the .transform method using a new dataset that has the same dimensions as the initial data that I used to fit UMAP, the embedding_ values change. But the embedding_ values won’t change if I transform new data with different dimensions.


import umap
import numpy as np

data = np.random.rand(100,50)
fitter = umap.UMAP().fit(data)

print(fitter.embedding_[:5,])
print()
transform_different = np.random.rand(200,50)
transform_same =np.random.rand(100,50)

fitter.transform(transform_different)
print(fitter.embedding_[:5,])
print()
fitter.transform(transform_same)
print(fitter.embedding_[:5,])

The output is

[[ 3.01767   -6.116551 ]
 [ 5.664277  -4.0695806]
 [ 4.170628  -5.1638904]
 [ 6.079433  -6.3256063]
 [ 5.794976  -4.5939784]]

[[ 3.01767   -6.116551 ]
 [ 5.664277  -4.0695806]
 [ 4.170628  -5.1638904]
 [ 6.079433  -6.3256063]
 [ 5.794976  -4.5939784]]

[[ 2.6579268e+00 -7.8308420e+00]
 [ 1.5021657e+00 -6.3121238e+00]
 [ 3.2839913e+00 -8.0168781e+00]
 [-3.5597345e-01 -6.4087682e+00]
 [ 6.9772257e-03 -8.1144466e+00]]

As you see the first two outputs contain the same embedding values. but when I transform using a dataset (100, 50), the same dimensions as the initial one, the embedding values change. Isn’t the embedding supposed to be maintained always? if the embedding changes the ability to transform new data will be deteriorated right? Hope you understand. Thanks in advance for your time!

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
jlmelvillecommented, Apr 6, 2020

@DanSchnell I don’t know if there is a description of the transform process anywhere, but it’s similar to how the initial dataset is created:

  1. For each point in the new data, find the k nearest neighbors in the original data.
  2. Calculate the smooth knn distances between the the test point and its training set neighbors, which returns the sigma and rho values.
  3. Compute the membership strengths using sigma and rho.
  4. Initialize the output layout coordinates.
  5. Optimize the layout so the output distances resemble the membership strengths.

The only real difference in implementation specifics is in step 1, where neighbors are found between the new data and the original data, i.e. points in the new data cannot be considered neighbors of each other; and step 4, where the initial layout for the new data is based on the average of the output coordinates of the neighbors in the original data (as found in step 1).

1reaction
lmcinnescommented, Mar 31, 2020

Right now the problem occurs if you transform exactly the same number of points as the initial training data. Ideally the transform is meant for a small set of new samples, so you shouldn’t be using it for the same amount of data as the original training set. As long as you don’t have the numbers line up the embedding should remain unchanged.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Transform and Unsupervised Data #55 - lmcinnes/umap
Right now UMAP is transductive -- it creates a single transform of all the data at once and you would need to redo...
Read more >
4.4 Embedding Transformations in a Model - Oracle Help Center
These two functions can be used to create a new transformation list from the transformations embedded in an existing model.
Read more >
Neural Network Embeddings Explained - Towards Data Science
An embedding is a mapping of a discrete — categorical — variable to a vector of continuous numbers. In the context of neural...
Read more >
Embeddings in Machine Learning: Everything You Need to ...
It works by transforming the user's text and an image into an embedding in the same latent space.
Read more >
Transforming New Data with UMAP - Read the Docs
Since we embedded to two dimensions we can visualise the results to ensure that we are getting a potential benefit out of this...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found