question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

A rough roadmap of things to be done for UMAP. Some of these tasks are easy, some are hard, and some require deeper knowledge of UMAP. Short and medium term tasks should be approachable for many people. Reply to this issue if you are interested in taking up any of them.

Short term items

  • Support for sparse matrix input
  • Add random seed as an user option
  • Support for cosine distance RP-trees
  • Allow non-RP-tree initialisation of NN-descent
  • Better document (via docstrings) all the support functions
  • “Custom” initialisation with a predefined positioning.

Medium term items

  • Generate notebook for basic usage demonstration
  • Generate notebook explaining parameter options and their effects
  • Set up CI and build a basic test suite
  • Start building basic documentation and integrate with readthedocs

Longer term items

  • Generate notebook for “How UMAP works”
  • Add code (and devise API(?)) for UMAP on general pandas dataframes
  • Add support for semi-supervised dimension reduction via UMAP
  • UMAP as a generative model (code + demo)
  • UMAP for text data (similar to word2vec)
  • A transform function for new previously unseen data (see issue #40)
  • Model persistence for UMAP models

No priority

  • GPU support for UMAP
  • Conda-forge UMAP package
  • Improve numba usage (better numba expertise required)
  • Concurrency via Dask for multicore and distributed support

Issue Analytics

  • State:open
  • Created 6 years ago
  • Reactions:21
  • Comments:36 (21 by maintainers)

github_iconTop GitHub Comments

4reactions
bcchocommented, Aug 29, 2018

Unfortunately I have no control over moving to python 3 (as much as I would like to), but for a workaround, I can try saving individual subobjects to files and re-loading them. Can you indicate what subobjects and parameters are required for transform to work correctly?

EDIT: After iterating through individual attributes from dir(trans), it looks like _random_init, _search, and _tree_init are the culprits. They are all instances of @numba.njit called on nested functions, but using dill didn’t resolve the problem, and it seems they are necessary for transform.

EDIT: Here is a functioning workaround for Python 2:

import pickle

def save_umap(umap):
    for attr in ["_tree_init", "_search", "_random_init"]:
        if hasattr(umap, attr):
            delattr(umap, attr)
    return pickle.dumps(umap, pickle.HIGHEST_PROTOCOL)

def load_umap(s):
    umap = pickle.loads(s)
    from umap.nndescent import make_initialisations, make_initialized_nnd_search
    umap._random_init, umap._tree_init = make_initialisations(
        umap._distance_func, umap._dist_args
    )
    umap._search = make_initialized_nnd_search(
        umap._distance_func, umap._dist_args
    )
    return umap

import numpy as np
X = np.random.randn(5000, 16)
X_new = np.random.randn(100, 16)

from umap import UMAP
um = UMAP()
um.fit(X)
emb = um.transform(X_new)

pkl = save_umap(um)
um_new = load_umap(pkl) # no error!

emb_new = um_new.transform(X_new)
3reactions
josephcourtneycommented, Jul 20, 2018

Unless I am not understanding something, pickling seems to work fine, at least on the current main branch. Here is a simple example that shows pickling and unpickling of a trained model, even with a custom metric. Note: if you unpickle a model with a custom metric, that metric must already be defined in that same file; the pickle only contains a reference to the metric function.

import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import umap
import pickle


digits = load_digits()

X_train, X_test, y_train, y_test = train_test_split(
    digits.data,
    digits.target,
    stratify=digits.target,
    random_state=42
)


def mydist(x, y):
    return np.max(np.abs(x - y))


trans = umap.UMAP(
    n_neighbors=5,
    random_state=42,
    metric=mydist
).fit(X_train)
plt.scatter(trans.embedding_[:, 0], trans.embedding_[:, 1], s=5, c=y_train, cmap='Spectral')
plt.title('Embedding of the training set by UMAP', fontsize=24)
plt.show()
plt.close()


with open('trans.pkl', 'wb') as f:
    pickle.dump(trans, f, pickle.HIGHEST_PROTOCOL)

with open('trans.pkl', 'rb') as f:
    trans = pickle.load(f)


test_embedding = trans.transform(X_test)
plt.scatter(test_embedding[:, 0], test_embedding[:, 1], s=5, c=y_test, cmap='Spectral')
plt.title('Embedding of the test set by UMAP', fontsize=24)
plt.show()
plt.close()
Read more comments on GitHub >

github_iconTop Results From Across the Web

uMap
uMap lets you create maps with OpenStreetMap layers in a minute and embed them in your site. Choose the layers of your map;...
Read more >
UMAP: Uniform Manifold Approximation and Projection for ...
Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for ......
Read more >
UMAP Dimensionality Reduction Algorithm | Product Roadmap
Many people usually ask us how it works or demand to use other standard algorithm that are becoming popular for this technique, likes...
Read more >
2022 Culture Check-In Survey - uMap
We have a road map to get you there. ... Learn how uMap™ can help you equip your organization with what your teams...
Read more >
UMAP explained | The best dimensionality reduction?
UMAP explained! The great dimensionality reduction ... UMAP intro. •. Scroll for details ... A roadmap to natural language understanding.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found