Initialize queryable index with init_graph?
See original GitHub issueI would like to be able to more quickly construct an NNDescent object from a precomputed distance matrix, and potentially RP Forest. This is sorta possible right now with the init_graph
argument, but the NNDescent
object constructed cannot be queried since there’s no RP Forest (#103).
Example of this failing
import pynndescent
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
train, test = train_test_split(make_blobs(10_000)[0])
from_scratch = pynndescent.NNDescent(train, n_neighbors=15)
indices, _ = from_scratch._neighbor_graph
from_scratch.query(test) # works
from_init = pynndescent.NNDescent(train, n_neighbors=15, init_graph= indices)
from_init.query(test)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-15-d1ceea2a4663> in <module>
----> 1 from_init.query(test)
~/github/pynndescent/pynndescent/pynndescent_.py in query(self, query_data, k, epsilon)
1585 """
1586 if not hasattr(self, "_search_graph"):
-> 1587 self._init_search_graph()
1588
1589 if not self._is_sparse:
~/github/pynndescent/pynndescent/pynndescent_.py in _init_search_graph(self)
953
954 if not hasattr(self, "_search_forest"):
--> 955 tree_scores = [
956 score_linked_tree(tree, self._neighbor_graph[0])
957 for tree in self._rp_forest
TypeError: 'NoneType' object is not iterable
I think this would be fairly straightforward to make work. Here is a rough proof of concept:
Hacky example of this working
from_scratch = pynndescent.NNDescent(train, n_neighbors=15)
indices, _ = from_scratch._neighbor_graph
rp_forest = from_scratch._rp_forest
query_indices_scratch, query_distances_scratch = from_scratch.query(test)
from_init = pynndescent.NNDescent(train, n_neighbors=15, init_graph=indices)
from_init._rp_forest = rp_forest
query_indices_init, query_distances_init = from_scratch.query(test)
# I think they won't always be exactly the same at the moment, but I think this could be addressed
np.testing.assert_allclose(query_indices_scratch, query_indices_init)
np.testing.assert_allclose(query_distances_scratch, query_distances_init)
Use case
The use case is in single cell analysis, where we are storing the neighbor graph for our dataset (as we use it for multiple purposes) and wanting to be able speed up querying the dataset using reconstructed graphs.
Construction benchmark
Using fmnist data from ann benchmark
%time pynndescent.NNDescent(fmnist_train)
# CPU times: user 1min 3s, sys: 813 ms, total: 1min 4s
# Wall time: 5.92 s
# indices taken from ^
%time pynndescent.NNDescent(fmnist_train, init_graph=indices)
CPU times: user 10.5 s, sys: 49.2 ms, total: 10.5 s
Wall time: 1.4 s
Questions
I’m not sure how much validation should/ could be done to make sure initialization values are valid. To some extent, I think it’s also alright to say “user beware” here. Some thoughts on checks that could be made:
- Of course, the graph, forest, and dataset must have related sizes.
- The graph should have the correct
k
. - If distances could be passed as well as indices, their values could be verified (also verifying the metric)
Could values like rp_forest
, _search_function
, be created on demand and cached via a property? That might make state easier to manage, and I believe all necessary parameters are already being stored in the NNDescent
object.
This is more of an extension, but does the init_graph
have to have the correct values of k
? For instance, it would ostensibly speed up index creation even if you could only provide an initial graph for a smaller value of K.
(ping @Koncopd)
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (7 by maintainers)
Top GitHub Comments
Ah, so exactly what the output looks like! That’s handy.
As you have no doubt noticed by now, I have a problem of saying I’ll contribute PRs and then not getting to them, so I will demur this time. 😂
@jamestwebber : The docs are definitely lacking on that front. In reality I hacked something in for my own needs for some experiments, and never got around to making a user friendly version. The short answer is that, right now, it takes a 2D array of shape (n_samples, n_neighbors) such that the entry at (i, j) gives the index of the jth neighbour of the ith data sample. I would happily accept a PR for documentation on that, or perhaps even a user friendlier version (accepting, say, sparse adjacency matrices and networkX graphs as well maybe?), if you cared to try.
@jlmelville : Wow, sorry, this totally slipped through the cracks on me! I apologise for not getting back to you. I think the major thing is that the
_search_graph
is constructed from theneighbor_graph_
; it is really just a pruned version of a symmetrized (i.e. undirected) version of theneighbor_graph_
. On the other hand the tree to initialize the search is pretty useful because it ensures that we start from a “decent” place – perhaps not ideal, but a lot less likely to run into local minima far from the query point. In my experience it mostly helps with speed of convergence though, and not accuracy.