question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Passing sparse distance matrices or KNNs directly

See original GitHub issue

Hey, I’m trying to do clustering on approx 300k samples using the Word Mover’s Distance as a metric. This means that the input data does not take the form of vectors, and all-pairs distance comparisons are very expensive to compute. However, I can efficiently get all-pairs (approximate) nearest neighbours using methods like the ‘centroid distance’ from the paper and populate a sparse distance matrix with the distance values from all points to their nearest neighbours.

I’d really like to try passing those sparse distances to UMAP, but I get the following error when I try:

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type pyobject
[1] During: typing of argument at /home/matt/miniconda3/envs/story-clustering/lib/python3.7/site-packages/umap_learn-0.4.2-py3.7.egg/umap/utils.py (29)

File "../../../miniconda3/envs/story-clustering/lib/python3.7/site-packages/umap_learn-0.4.2-py3.7.egg/umap/utils.py", line 29:
def fast_knn_indices(X, n_neighbors):
    <source elided>
    """
    knn_indices = np.empty((X.shape[0], n_neighbors), dtype=np.int32)
    ^

This error may have been caused by the following argument(s):
- argument 0: cannot determine Numba type of <class 'scipy.sparse.csr.csr_matrix'>

Is there any way I can get around this? I can compute nearest neighbours for free just by sorting rows of the sparse distance matrix, but I don’t understand the UMAP code well enough to hack my precomputed NNs in. The distance matrix is also much too big to convert to dense form - a 300,000 x 300,000 matrix of float32 values would take ~340GB of memory, so I’m kind of stuck!

Here’s my very simple code snippet that triggers this error:

data = scipy.sparse.load_npz('sparse_distances.npz').tocsr()
fit = umap.UMAP(metric='precomputed')
embedding = fit.fit_transform(data)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:14 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
lmcinnescommented, Sep 9, 2021

If you want one more thing to play with (definitely experimental mind you), I have an implementation of linear optimal transport (with compression) in the vectorizers library. This allows for something in between centroid based approaches, and a full optimal transport computation, with a linearization of optimal transport. If you have an NxD matrix of word vectors W and a bag-of-words, TF-IDF, or similar MxN sparse matrix of documents X then the relevant code would be along the lines of

lot_vectors = vectorizers.WassersteinVectorizer().fit_transform(X, vectors=W)

This will produce an MxD matrix of document vectors such that cosine distances between the vectors should be correlated with word mover distance between the documents. The quality of the correlation depends on a number of things, and there are parameters you can tune, but this would be a start. If your documents are relatively long you can use vectorizers.SinkhornVectorizer instead, which uses Sinkhorn iterations as an approximation (or entropic regularization) of optimal transport that wqill run a lot faster, especially as document length grows.

1reaction
Rocketknight1commented, Sep 8, 2021

In my testing I found it worked okay - I don’t have exact numbers to hand, but I think in a dataset of around 10k samples, about 200 centroid NNs reliably contained most of the 25 true NNs, and almost always more than half of them.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Sparse Distance Matrices — Ripser.py 0.6.4 documentation
This code demonstrates how to use sparse distance matrices in ripser. As you will see, ripser automatically understands the scipy sparse library.
Read more >
umap.umap_ — umap 0.5 documentation - Read the Docs
If a string is passed it must match a valid predefined metric. If a general metric is required a ... print("Computing KNNs for...
Read more >
Efficiently compute a sparse distance matrix for a subgraph ...
I have a sparse SciPy CSR distance matrix D that defines a graph G with vertices V=[1,..,n] with some large n>100000 .
Read more >
sklearn.neighbors.KNeighborsClassifier
If metric is “precomputed”, X is assumed to be a distance matrix and must be square during fit. X may be a sparse...
Read more >
Efficient, Sparse Representation of Manifold Distance ...
Geodesic distance matrices can reveal shape properties ... one pass over the adjacency matrix A, which is also sparse.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found