transform function changes the embedding_
See original GitHub issueHi, I have noticed a weird thing, I am not sure if its the expected behavior of UMAP so I am posting this. I have realized that when I use the .transform method using a new dataset that has the same dimensions as the initial data that I used to fit UMAP, the embedding_ values change. But the embedding_ values won’t change if I transform new data with different dimensions.
import umap
import numpy as np
data = np.random.rand(100,50)
fitter = umap.UMAP().fit(data)
print(fitter.embedding_[:5,])
print()
transform_different = np.random.rand(200,50)
transform_same =np.random.rand(100,50)
fitter.transform(transform_different)
print(fitter.embedding_[:5,])
print()
fitter.transform(transform_same)
print(fitter.embedding_[:5,])
The output is
[[ 3.01767 -6.116551 ]
[ 5.664277 -4.0695806]
[ 4.170628 -5.1638904]
[ 6.079433 -6.3256063]
[ 5.794976 -4.5939784]]
[[ 3.01767 -6.116551 ]
[ 5.664277 -4.0695806]
[ 4.170628 -5.1638904]
[ 6.079433 -6.3256063]
[ 5.794976 -4.5939784]]
[[ 2.6579268e+00 -7.8308420e+00]
[ 1.5021657e+00 -6.3121238e+00]
[ 3.2839913e+00 -8.0168781e+00]
[-3.5597345e-01 -6.4087682e+00]
[ 6.9772257e-03 -8.1144466e+00]]
As you see the first two outputs contain the same embedding values. but when I transform using a dataset (100, 50), the same dimensions as the initial one, the embedding values change. Isn’t the embedding supposed to be maintained always? if the embedding changes the ability to transform new data will be deteriorated right? Hope you understand. Thanks in advance for your time!
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:5 (1 by maintainers)
Top GitHub Comments
@DanSchnell I don’t know if there is a description of the transform process anywhere, but it’s similar to how the initial dataset is created:
The only real difference in implementation specifics is in step 1, where neighbors are found between the new data and the original data, i.e. points in the new data cannot be considered neighbors of each other; and step 4, where the initial layout for the new data is based on the average of the output coordinates of the neighbors in the original data (as found in step 1).
Right now the problem occurs if you transform exactly the same number of points as the initial training data. Ideally the transform is meant for a small set of new samples, so you shouldn’t be using it for the same amount of data as the original training set. As long as you don’t have the numbers line up the embedding should remain unchanged.