Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature Request] Allow support for "precomputed" distance matrix for umap.umap_.fuzzy_simplicial_set

See original GitHub issue

I’ve been working a lot with precomputed distance matrices lately. The option to use these precomputed distances in umap.umap_.fuzzy_simplicial_set would be really helpful.

Is this possible with any current hacks? If not, could this be possible to implement in future versions?

Issue Analytics

State:
Created 2 years ago
Comments:11 (2 by maintainers)

Top GitHub Comments

1reaction

jolespincommented, Feb 8, 2022

@lmcinnes Thank you, this is an extremely useful explanation (I also didn’t know it was that easy to use @numb.njit()). I’m working on a wrapper around you fuzzy_simplical_set to use with my code and it’s helpful knowing how the X, knn_indices, knn_dists, and angular arguments are used. Looking forward to apply this to some microbiome and sequencing datasets.

As far as the aitchison distance, yes doing CLR transform followed by Euclidean is definitely the most computational efficient way AFAIK. However, doing things like variance log-ratio or rho proportionality is less straight forward so the @numba.njit() support will be extremely useful.

0reactions

lmcinnescommented, Feb 8, 2022

If knn_indices and knn_dists are specified (and not None) then X will be ignored and the knn_indices and knn_dists will be used directly. So you can either not specify the indices and dists and provide an X (which can be a feature matrix, or, if metric="precomputed", a distance matrix), or just directly specify the indices and dists and use those.

The set_op_mix_ratio and local_connectivity are relevant for symmetrization so they will be used regardless of the choice of input. In contrast angular is about what kinds of trees to use for nearest neighbour approximation – it will only matter if you specify X as a feature matrix.

Lastly, looking through all of this now, it is worth noting that the metric parameter can also be a (numba jitted) python function specifying how to compute a distance between two vectors. Unless you have sparse data (and you can’t really have sparse data and use Aitchison distance due to zeros) this should be straightforward (distances on sparse data involve more understanding of the sparse data formats to write). So, for example, you could have

from umap.umap_ import fuzzy_simplicial_set
import numba

@numba.njit()
def aitchison_distance(x, y):
    x_denominator = 0.0
    y_denominator = 0.0
    for i in range(x.shape[0]):
        x_denominator *= x[i]
        y_denominator *= y[i]

    x_denominator = np.power(x_denominator, 1.0 / x.shape[0])
    y_denominator = np.power(y_denominator, 1.0 / y.shape[0])

    result = 0.0

    for i in range(x.shape[0]):
        x_rescaled = np.log(x[i] / x_denominator)
        y_rescaled = np.log(y[i] / y_denominator)
        result += (x_rescaled - y_rescaled)**2

    return np.sqrt(result)

fuzzy_simplicial_set(X, metric=aitchison_distance)

In practice I think I would apply a little algebra and rewrite the distance computation for greater numerical stability (taking the log of a geometric mean, for example, could be computed better), but I wanted the computation to be relatively clear. Given the nature of the distance computation, however, I think you could just as well do:

import numpy as np

# Convert the data via CRT
log_scaled = np.log(X)
log_of_geometric_mean = np.mean(log_scaled, axis=1)
crt_X = log_scaled - log_of_geometric_mean [:, None]

# Aitchison is euclidean of CRT of data
fuzzy_simplicial_set(crt_X, metric="euclidean")

Top Results From Across the Web

umap.umap_ — umap 0.5 documentation - Read the Docs

That is, this is similar to knn-distance but allows continuous k values rather ... of each local fuzzy simplicial set -- this is...

umap/umap_.py at master · lmcinnes/umap - GitHub

The data to be modelled as a fuzzy simplicial set. n_neighbors: int. The number of neighbors to use to approximate geodesic distance.

umap Documentation - Read the Docs

first write a short utility function that can fit the data with UMAP given a set of parameter choices, and plot the result....

UMAP API Guide — umap 0.3 documentation

The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are ......

Dimensionality Reduction with UMAP - R-Project.org

A sparse matrix is interpreted as a distance matrix, and is assumed to be symmetric, ... Each metric calculation results in a separate...