UMAP Roadmap

See original GitHub issue

A rough roadmap of things to be done for UMAP. Some of these tasks are easy, some are hard, and some require deeper knowledge of UMAP. Short and medium term tasks should be approachable for many people. Reply to this issue if you are interested in taking up any of them.

Short term items

Support for sparse matrix input
Add random seed as an user option
Support for cosine distance RP-trees
Allow non-RP-tree initialisation of NN-descent
Better document (via docstrings) all the support functions
“Custom” initialisation with a predefined positioning.

Medium term items

Generate notebook for basic usage demonstration
Generate notebook explaining parameter options and their effects
Set up CI and build a basic test suite
Start building basic documentation and integrate with readthedocs

Longer term items

Generate notebook for “How UMAP works”
Add code (and devise API(?)) for UMAP on general pandas dataframes
Add support for semi-supervised dimension reduction via UMAP
UMAP as a generative model (code + demo)
UMAP for text data (similar to word2vec)
A transform function for new previously unseen data (see issue #40)
Model persistence for UMAP models

No priority

GPU support for UMAP
Conda-forge UMAP package
Improve numba usage (better numba expertise required)
Concurrency via Dask for multicore and distributed support

Issue Analytics

State:
Created 6 years ago
Reactions:21
Comments:36 (21 by maintainers)

Top GitHub Comments

4reactions

bcchocommented, Aug 29, 2018

Unfortunately I have no control over moving to python 3 (as much as I would like to), but for a workaround, I can try saving individual subobjects to files and re-loading them. Can you indicate what subobjects and parameters are required for transform to work correctly?

EDIT: After iterating through individual attributes from dir(trans), it looks like _random_init, _search, and _tree_init are the culprits. They are all instances of @numba.njit called on nested functions, but using dill didn’t resolve the problem, and it seems they are necessary for transform.

EDIT: Here is a functioning workaround for Python 2:

import pickle

def save_umap(umap):
    for attr in ["_tree_init", "_search", "_random_init"]:
        if hasattr(umap, attr):
            delattr(umap, attr)
    return pickle.dumps(umap, pickle.HIGHEST_PROTOCOL)

def load_umap(s):
    umap = pickle.loads(s)
    from umap.nndescent import make_initialisations, make_initialized_nnd_search
    umap._random_init, umap._tree_init = make_initialisations(
        umap._distance_func, umap._dist_args
    )
    umap._search = make_initialized_nnd_search(
        umap._distance_func, umap._dist_args
    )
    return umap

import numpy as np
X = np.random.randn(5000, 16)
X_new = np.random.randn(100, 16)

from umap import UMAP
um = UMAP()
um.fit(X)
emb = um.transform(X_new)

pkl = save_umap(um)
um_new = load_umap(pkl) # no error!

emb_new = um_new.transform(X_new)

3reactions

josephcourtneycommented, Jul 20, 2018

Unless I am not understanding something, pickling seems to work fine, at least on the current main branch. Here is a simple example that shows pickling and unpickling of a trained model, even with a custom metric. Note: if you unpickle a model with a custom metric, that metric must already be defined in that same file; the pickle only contains a reference to the metric function.

import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import umap
import pickle


digits = load_digits()

X_train, X_test, y_train, y_test = train_test_split(
    digits.data,
    digits.target,
    stratify=digits.target,
    random_state=42
)


def mydist(x, y):
    return np.max(np.abs(x - y))


trans = umap.UMAP(
    n_neighbors=5,
    random_state=42,
    metric=mydist
).fit(X_train)
plt.scatter(trans.embedding_[:, 0], trans.embedding_[:, 1], s=5, c=y_train, cmap='Spectral')
plt.title('Embedding of the training set by UMAP', fontsize=24)
plt.show()
plt.close()


with open('trans.pkl', 'wb') as f:
    pickle.dump(trans, f, pickle.HIGHEST_PROTOCOL)

with open('trans.pkl', 'rb') as f:
    trans = pickle.load(f)


test_embedding = trans.transform(X_test)
plt.scatter(test_embedding[:, 0], test_embedding[:, 1], s=5, c=y_test, cmap='Spectral')
plt.title('Embedding of the test set by UMAP', fontsize=24)
plt.show()
plt.close()

Top Results From Across the Web

uMap

uMap lets you create maps with OpenStreetMap layers in a minute and embed them in your site. Choose the layers of your map;...

UMAP: Uniform Manifold Approximation and Projection for ...

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for ......

UMAP Dimensionality Reduction Algorithm | Product Roadmap

Many people usually ask us how it works or demand to use other standard algorithm that are becoming popular for this technique, likes...

2022 Culture Check-In Survey - uMap

We have a road map to get you there. ... Learn how uMap™ can help you equip your organization with what your teams...

UMAP explained | The best dimensionality reduction?

UMAP explained! The great dimensionality reduction ... UMAP intro. •. Scroll for details ... A roadmap to natural language understanding.