Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Passing sparse distance matrices or KNNs directly

See original GitHub issue

Hey, I’m trying to do clustering on approx 300k samples using the Word Mover’s Distance as a metric. This means that the input data does not take the form of vectors, and all-pairs distance comparisons are very expensive to compute. However, I can efficiently get all-pairs (approximate) nearest neighbours using methods like the ‘centroid distance’ from the paper and populate a sparse distance matrix with the distance values from all points to their nearest neighbours.

I’d really like to try passing those sparse distances to UMAP, but I get the following error when I try:

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type pyobject
[1] During: typing of argument at /home/matt/miniconda3/envs/story-clustering/lib/python3.7/site-packages/umap_learn-0.4.2-py3.7.egg/umap/utils.py (29)

File "../../../miniconda3/envs/story-clustering/lib/python3.7/site-packages/umap_learn-0.4.2-py3.7.egg/umap/utils.py", line 29:
def fast_knn_indices(X, n_neighbors):
    <source elided>
    """
    knn_indices = np.empty((X.shape[0], n_neighbors), dtype=np.int32)
    ^

This error may have been caused by the following argument(s):
- argument 0: cannot determine Numba type of <class 'scipy.sparse.csr.csr_matrix'>

Is there any way I can get around this? I can compute nearest neighbours for free just by sorting rows of the sparse distance matrix, but I don’t understand the UMAP code well enough to hack my precomputed NNs in. The distance matrix is also much too big to convert to dense form - a 300,000 x 300,000 matrix of float32 values would take ~340GB of memory, so I’m kind of stuck!

Here’s my very simple code snippet that triggers this error:

data = scipy.sparse.load_npz('sparse_distances.npz').tocsr()
fit = umap.UMAP(metric='precomputed')
embedding = fit.fit_transform(data)

Issue Analytics

State:
Created 3 years ago
Comments:14 (11 by maintainers)

Top GitHub Comments

1reaction

lmcinnescommented, Sep 9, 2021

If you want one more thing to play with (definitely experimental mind you), I have an implementation of linear optimal transport (with compression) in the vectorizers library. This allows for something in between centroid based approaches, and a full optimal transport computation, with a linearization of optimal transport. If you have an NxD matrix of word vectors W and a bag-of-words, TF-IDF, or similar MxN sparse matrix of documents X then the relevant code would be along the lines of

lot_vectors = vectorizers.WassersteinVectorizer().fit_transform(X, vectors=W)

This will produce an MxD matrix of document vectors such that cosine distances between the vectors should be correlated with word mover distance between the documents. The quality of the correlation depends on a number of things, and there are parameters you can tune, but this would be a start. If your documents are relatively long you can use vectorizers.SinkhornVectorizer instead, which uses Sinkhorn iterations as an approximation (or entropic regularization) of optimal transport that wqill run a lot faster, especially as document length grows.

1reaction

Rocketknight1commented, Sep 8, 2021

In my testing I found it worked okay - I don’t have exact numbers to hand, but I think in a dataset of around 10k samples, about 200 centroid NNs reliably contained most of the 25 true NNs, and almost always more than half of them.