question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Initialize queryable index with init_graph?

See original GitHub issue

I would like to be able to more quickly construct an NNDescent object from a precomputed distance matrix, and potentially RP Forest. This is sorta possible right now with the init_graph argument, but the NNDescent object constructed cannot be queried since there’s no RP Forest (#103).

Example of this failing
import pynndescent
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split

train, test = train_test_split(make_blobs(10_000)[0])

from_scratch = pynndescent.NNDescent(train, n_neighbors=15)
indices, _ = from_scratch._neighbor_graph
from_scratch.query(test)  # works

from_init = pynndescent.NNDescent(train, n_neighbors=15, init_graph= indices)
from_init.query(test)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-15-d1ceea2a4663> in <module>
----> 1 from_init.query(test)

~/github/pynndescent/pynndescent/pynndescent_.py in query(self, query_data, k, epsilon)
   1585         """
   1586         if not hasattr(self, "_search_graph"):
-> 1587             self._init_search_graph()
   1588 
   1589         if not self._is_sparse:

~/github/pynndescent/pynndescent/pynndescent_.py in _init_search_graph(self)
    953 
    954         if not hasattr(self, "_search_forest"):
--> 955             tree_scores = [
    956                 score_linked_tree(tree, self._neighbor_graph[0])
    957                 for tree in self._rp_forest

TypeError: 'NoneType' object is not iterable

I think this would be fairly straightforward to make work. Here is a rough proof of concept:

Hacky example of this working
from_scratch = pynndescent.NNDescent(train, n_neighbors=15)
indices, _ = from_scratch._neighbor_graph
rp_forest = from_scratch._rp_forest
query_indices_scratch, query_distances_scratch = from_scratch.query(test)

from_init = pynndescent.NNDescent(train, n_neighbors=15, init_graph=indices)
from_init._rp_forest = rp_forest
query_indices_init, query_distances_init = from_scratch.query(test)

# I think they won't always be exactly the same at the moment, but I think this could be addressed
np.testing.assert_allclose(query_indices_scratch, query_indices_init)
np.testing.assert_allclose(query_distances_scratch, query_distances_init)

Use case

The use case is in single cell analysis, where we are storing the neighbor graph for our dataset (as we use it for multiple purposes) and wanting to be able speed up querying the dataset using reconstructed graphs.

Construction benchmark

Using fmnist data from ann benchmark

%time pynndescent.NNDescent(fmnist_train)
# CPU times: user 1min 3s, sys: 813 ms, total: 1min 4s
# Wall time: 5.92 s

# indices taken from ^
%time pynndescent.NNDescent(fmnist_train, init_graph=indices)
CPU times: user 10.5 s, sys: 49.2 ms, total: 10.5 s
Wall time: 1.4 s

Questions

I’m not sure how much validation should/ could be done to make sure initialization values are valid. To some extent, I think it’s also alright to say “user beware” here. Some thoughts on checks that could be made:

  • Of course, the graph, forest, and dataset must have related sizes.
  • The graph should have the correct k.
  • If distances could be passed as well as indices, their values could be verified (also verifying the metric)

Could values like rp_forest, _search_function, be created on demand and cached via a property? That might make state easier to manage, and I believe all necessary parameters are already being stored in the NNDescent object.

This is more of an extension, but does the init_graph have to have the correct values of k? For instance, it would ostensibly speed up index creation even if you could only provide an initial graph for a smaller value of K.

(ping @Koncopd)

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
jamestwebbercommented, Sep 15, 2021

The short answer is that, right now, it takes a 2D array of shape (n_samples, n_neighbors) such that the entry at (i, j) gives the index of the jth neighbour of the ith data sample.

Ah, so exactly what the output looks like! That’s handy.

I would happily accept a PR for documentation on that, or perhaps even a user friendlier version (accepting, say, sparse adjacency matrices and networkX graphs as well maybe?), if you cared to try.

As you have no doubt noticed by now, I have a problem of saying I’ll contribute PRs and then not getting to them, so I will demur this time. 😂

0reactions
lmcinnescommented, Sep 15, 2021

@jamestwebber : The docs are definitely lacking on that front. In reality I hacked something in for my own needs for some experiments, and never got around to making a user friendly version. The short answer is that, right now, it takes a 2D array of shape (n_samples, n_neighbors) such that the entry at (i, j) gives the index of the jth neighbour of the ith data sample. I would happily accept a PR for documentation on that, or perhaps even a user friendlier version (accepting, say, sparse adjacency matrices and networkX graphs as well maybe?), if you cared to try.

@jlmelville : Wow, sorry, this totally slipped through the cracks on me! I apologise for not getting back to you. I think the major thing is that the _search_graph is constructed from the neighbor_graph_; it is really just a pruned version of a symmetrized (i.e. undirected) version of the neighbor_graph_. On the other hand the tree to initialize the search is pretty useful because it ensures that we start from a “decent” place – perhaps not ideal, but a lot less likely to run into local minima far from the query point. In my experience it mostly helps with speed of convergence though, and not accuracy.

Read more comments on GitHub >

github_iconTop Results From Across the Web

BGI Documentation for initgraph
initgraph initializes the graphics system by loading a graphics driver from disk (or validating a registered driver), and putting the system into graphics...
Read more >
Instantiate empty IQueryable for use with Linq to sql
Try this. You can create a generic type with T or a specific type by replacing T with your type name. IQueryable listOppLineData...
Read more >
neo4jmapper
src/index.js ... make relationships queryable with custom queries ... applyDefaultValues = null; // will be initialized Relationship.prototype.
Read more >
VBS | Charles Hooper's Oracle Notes | Page 2
“Starting with Oracle Database 10g Release 1 (10.1), the number of cached cursors is determined by the SESSION_CACHED_CURSORS initialization ...
Read more >
https://murlengine.com/builds/changes.txt
Fixed a problem when re-initializing any of Graph::GenericParameters and related nodes. ... Queryable via IEnums::FEATURE_INDEX_BUFFER_FORMAT_UINT8 and ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found