Passing sparse distance matrices or KNNs directly
See original GitHub issueHey, I’m trying to do clustering on approx 300k samples using the Word Mover’s Distance as a metric. This means that the input data does not take the form of vectors, and all-pairs distance comparisons are very expensive to compute. However, I can efficiently get all-pairs (approximate) nearest neighbours using methods like the ‘centroid distance’ from the paper and populate a sparse distance matrix with the distance values from all points to their nearest neighbours.
I’d really like to try passing those sparse distances to UMAP, but I get the following error when I try:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type pyobject
[1] During: typing of argument at /home/matt/miniconda3/envs/story-clustering/lib/python3.7/site-packages/umap_learn-0.4.2-py3.7.egg/umap/utils.py (29)
File "../../../miniconda3/envs/story-clustering/lib/python3.7/site-packages/umap_learn-0.4.2-py3.7.egg/umap/utils.py", line 29:
def fast_knn_indices(X, n_neighbors):
<source elided>
"""
knn_indices = np.empty((X.shape[0], n_neighbors), dtype=np.int32)
^
This error may have been caused by the following argument(s):
- argument 0: cannot determine Numba type of <class 'scipy.sparse.csr.csr_matrix'>
Is there any way I can get around this? I can compute nearest neighbours for free just by sorting rows of the sparse distance matrix, but I don’t understand the UMAP code well enough to hack my precomputed NNs in. The distance matrix is also much too big to convert to dense form - a 300,000 x 300,000 matrix of float32 values would take ~340GB of memory, so I’m kind of stuck!
Here’s my very simple code snippet that triggers this error:
data = scipy.sparse.load_npz('sparse_distances.npz').tocsr()
fit = umap.UMAP(metric='precomputed')
embedding = fit.fit_transform(data)
Issue Analytics
- State:
- Created 3 years ago
- Comments:14 (11 by maintainers)
Top GitHub Comments
If you want one more thing to play with (definitely experimental mind you), I have an implementation of linear optimal transport (with compression) in the vectorizers library. This allows for something in between centroid based approaches, and a full optimal transport computation, with a linearization of optimal transport. If you have an NxD matrix of word vectors W and a bag-of-words, TF-IDF, or similar MxN sparse matrix of documents X then the relevant code would be along the lines of
This will produce an MxD matrix of document vectors such that cosine distances between the vectors should be correlated with word mover distance between the documents. The quality of the correlation depends on a number of things, and there are parameters you can tune, but this would be a start. If your documents are relatively long you can use
vectorizers.SinkhornVectorizer
instead, which uses Sinkhorn iterations as an approximation (or entropic regularization) of optimal transport that wqill run a lot faster, especially as document length grows.In my testing I found it worked okay - I don’t have exact numbers to hand, but I think in a dataset of around 10k samples, about 200 centroid NNs reliably contained most of the 25 true NNs, and almost always more than half of them.