question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Are umap transformations non-deterministic?

See original GitHub issue

I am trying to use umap to preprocess some data, and I’ve noticed that the same vector gives a different result according to the number of rows that is being passed to the transformation.

i.e the same row vector A outputs different vector according to the shape of the data (# of rows) being transformed.

# fit umap to data X 
reducer = umap.UMAP().fit(X)
# transform X using reducer 
embedding = reducer.transform(X)
# get subset of X to transform 
embedding_sub = reducer.transform(X[:100,:])
# => I was assuming embedding_sub == embedding[:100, :]
# => but that wasn't the case

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:2
  • Comments:17 (5 by maintainers)

github_iconTop GitHub Comments

8reactions
dataistcommented, Mar 12, 2019

Hey all, I thought I’d add some analysis to this discussion via colab here.

There I do quick quantification of the randomness introduced and explore its impact on some downstream tasks. You can find conclusions below (also in notebook):

We see cluster assignment can be super unstable as a result of the noise introduced by umap’s transform even when it is the exact data used for embed training. Obviously, if your downstream task maps to what we illustrate above you may be in trouble and your downstream tasks heavily impacted by the stochastic nature of the umap transform…

Im curious if there are any parameters we can add to the underlying knn search to help reduce the scale of the stochastic noise it introduces (even at computational expense)?

Separately, I would argue that the current implementation’s aggressive memoization used by the transform function could obfuscate downstream problems. If its stochastic nature was evident in all scenarios it would be more obvious and help highlight when it was a source of problems like we illustrate here.

I also did some additional parameter exploration on how each may help to mitigate some of the scale of the issue (not illustrated here):

        n_neighbors: no impact
        n_components: no impact
        metric: no impact
        n_epochs: some (30%) improvement as we go over 1000
        learning_rate: some (20%) improvement as we get around 0.5
        init: no impact 
        min_dist: some impact in concert with spread 
        spread: some positive impact by setting this to np.std of X_train 
        set_op_mix_ratio: seems best to keep at 1
        local_connectivity: no impact,
        transform_queue_size: no impact
        random_state: no impact
        transform_seed: no impact
1reaction
apbardcommented, Jun 10, 2019

I would like to add to @dataist’s experiments that we get different result for the same point also if:

  • data are shuffled
  • there are duplicate points (in this case it seems that just the first occurence for each duplicate assume a different embedding)
Read more comments on GitHub >

github_iconTop Results From Across the Web

UMAP Reproducibility — umap 0.5 documentation
UMAP is a stochastic algorithm – it makes use of randomness both to speed up approximation steps, and to aid in solving hard...
Read more >
tSNE vs. UMAP: Global Structure - Towards Data Science
Being initialized with PCA or Graph Laplacian, tSNE becomes a deterministic method. In contrast, UMAP keeps its stochasticity even being ...
Read more >
Intuitive explanation of how UMAP works, compared to t-SNE
With UMAP, you should be able to interpret both the distances between / positions of points and clusters. Both algorithms are highly stochastic...
Read more >
UMAP Based Anomaly Detection for Minimal Residual ... - NCBI
Keywords: acute myeloid leukemia, anomaly detection, UMAP, set-transformer ... and UMAP as well as HDBSCAN are non-deterministic algorithms, ...
Read more >
Understanding UMAP
It's also notable that t-SNE projections vary widely from run to run, with different pieces of the higher-dimensional data projected to different locations....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found