Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

transform function changes the embedding_

See original GitHub issue

Hi, I have noticed a weird thing, I am not sure if its the expected behavior of UMAP so I am posting this. I have realized that when I use the .transform method using a new dataset that has the same dimensions as the initial data that I used to fit UMAP, the embedding_ values change. But the embedding_ values won’t change if I transform new data with different dimensions.


import umap
import numpy as np

data = np.random.rand(100,50)
fitter = umap.UMAP().fit(data)

print(fitter.embedding_[:5,])
print()
transform_different = np.random.rand(200,50)
transform_same =np.random.rand(100,50)

fitter.transform(transform_different)
print(fitter.embedding_[:5,])
print()
fitter.transform(transform_same)
print(fitter.embedding_[:5,])

The output is

[[ 3.01767   -6.116551 ]
 [ 5.664277  -4.0695806]
 [ 4.170628  -5.1638904]
 [ 6.079433  -6.3256063]
 [ 5.794976  -4.5939784]]

[[ 3.01767   -6.116551 ]
 [ 5.664277  -4.0695806]
 [ 4.170628  -5.1638904]
 [ 6.079433  -6.3256063]
 [ 5.794976  -4.5939784]]

[[ 2.6579268e+00 -7.8308420e+00]
 [ 1.5021657e+00 -6.3121238e+00]
 [ 3.2839913e+00 -8.0168781e+00]
 [-3.5597345e-01 -6.4087682e+00]
 [ 6.9772257e-03 -8.1144466e+00]]

As you see the first two outputs contain the same embedding values. but when I transform using a dataset (100, 50), the same dimensions as the initial one, the embedding values change. Isn’t the embedding supposed to be maintained always? if the embedding changes the ability to transform new data will be deteriorated right? Hope you understand. Thanks in advance for your time!

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:5 (1 by maintainers)

Top GitHub Comments

1reaction

jlmelvillecommented, Apr 6, 2020

@DanSchnell I don’t know if there is a description of the transform process anywhere, but it’s similar to how the initial dataset is created:

For each point in the new data, find the k nearest neighbors in the original data.
Calculate the smooth knn distances between the the test point and its training set neighbors, which returns the sigma and rho values.
Compute the membership strengths using sigma and rho.
Initialize the output layout coordinates.
Optimize the layout so the output distances resemble the membership strengths.

The only real difference in implementation specifics is in step 1, where neighbors are found between the new data and the original data, i.e. points in the new data cannot be considered neighbors of each other; and step 4, where the initial layout for the new data is based on the average of the output coordinates of the neighbors in the original data (as found in step 1).

1reaction

lmcinnescommented, Mar 31, 2020

Right now the problem occurs if you transform exactly the same number of points as the initial training data. Ideally the transform is meant for a small set of new samples, so you shouldn’t be using it for the same amount of data as the original training set. As long as you don’t have the numbers line up the embedding should remain unchanged.