Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Are umap transformations non-deterministic?

See original GitHub issue

I am trying to use umap to preprocess some data, and I’ve noticed that the same vector gives a different result according to the number of rows that is being passed to the transformation.

i.e the same row vector A outputs different vector according to the shape of the data (# of rows) being transformed.

# fit umap to data X 
reducer = umap.UMAP().fit(X)
# transform X using reducer 
embedding = reducer.transform(X)
# get subset of X to transform 
embedding_sub = reducer.transform(X[:100,:])
# => I was assuming embedding_sub == embedding[:100, :]
# => but that wasn't the case

Issue Analytics

State:
Created 5 years ago
Reactions:2
Comments:17 (5 by maintainers)

Top GitHub Comments

8reactions

dataistcommented, Mar 12, 2019

Hey all, I thought I’d add some analysis to this discussion via colab here.

There I do quick quantification of the randomness introduced and explore its impact on some downstream tasks. You can find conclusions below (also in notebook):

We see cluster assignment can be super unstable as a result of the noise introduced by umap’s transform even when it is the exact data used for embed training. Obviously, if your downstream task maps to what we illustrate above you may be in trouble and your downstream tasks heavily impacted by the stochastic nature of the umap transform…

Im curious if there are any parameters we can add to the underlying knn search to help reduce the scale of the stochastic noise it introduces (even at computational expense)?

Separately, I would argue that the current implementation’s aggressive memoization used by the transform function could obfuscate downstream problems. If its stochastic nature was evident in all scenarios it would be more obvious and help highlight when it was a source of problems like we illustrate here.

I also did some additional parameter exploration on how each may help to mitigate some of the scale of the issue (not illustrated here):
        n_neighbors: no impact
        n_components: no impact
        metric: no impact
        n_epochs: some (30%) improvement as we go over 1000
        learning_rate: some (20%) improvement as we get around 0.5
        init: no impact 
        min_dist: some impact in concert with spread 
        spread: some positive impact by setting this to np.std of X_train 
        set_op_mix_ratio: seems best to keep at 1
        local_connectivity: no impact,
        transform_queue_size: no impact
        random_state: no impact
        transform_seed: no impact

1reaction

apbardcommented, Jun 10, 2019

I would like to add to @dataist’s experiments that we get different result for the same point also if:

data are shuffled
there are duplicate points (in this case it seems that just the first occurence for each duplicate assume a different embedding)

Top Results From Across the Web

UMAP Reproducibility — umap 0.5 documentation

UMAP is a stochastic algorithm – it makes use of randomness both to speed up approximation steps, and to aid in solving hard...

tSNE vs. UMAP: Global Structure - Towards Data Science

Being initialized with PCA or Graph Laplacian, tSNE becomes a deterministic method. In contrast, UMAP keeps its stochasticity even being ...

Intuitive explanation of how UMAP works, compared to t-SNE

With UMAP, you should be able to interpret both the distances between / positions of points and clusters. Both algorithms are highly stochastic...

UMAP Based Anomaly Detection for Minimal Residual ... - NCBI

Keywords: acute myeloid leukemia, anomaly detection, UMAP, set-transformer ... and UMAP as well as HDBSCAN are non-deterministic algorithms, ...

Understanding UMAP

It's also notable that t-SNE projections vary widely from run to run, with different pieces of the higher-dimensional data projected to different locations....