BallTree query match time is O(n) not O(log(n))
See original GitHub issueI’ve run performance analysis on matching NN with BallTree (same with KDTree), and the matching time is linear to number of elements, and should be O(log(n)).
Here are the result of benchmark: num_elements, match_time 10000 0.09097146987915039 20000 0.18194293975830078 40000 0.3668830394744873 80000 0.7527577877044678
Here is my code:
from sklearn.neighbors import BallTree
import numpy as np
import time
def tree_perf(tree_size):
X = np.random.rand(tree_size, 512)
Y1 = np.random.rand(1, 512)
Y2 = np.random.rand(10, 512)
ts = time.time()
kdt = BallTree(X, leaf_size=30, metric='euclidean')
load_tree = time.time() - ts
num_nn = 1
ts = time.time()
vs = kdt.query(Y1, k=num_nn, return_distance=True)
match1 = time.time() - ts
ts = time.time()
vs = kdt.query(Y2, k=num_nn, return_distance=True)
match10 = time.time() - ts
print(tree_size, load_tree, match1, match10)
print("num_elements", "load_tree", "match_1", "match_10")
for i in range(100):
tree_size = 10000 + i * 10000
tree_perf(tree_size)
Issue Analytics
- State:
- Created 3 years ago
- Comments:18 (7 by maintainers)
Top Results From Across the Web
sklearn.neighbors.BallTree — scikit-learn 1.2.0 documentation
BallTree for fast generalized N-point problems ... Changing leaf_size will not affect the results of a query, ... Query for k-nearest neighbors.
Read more >Benchmarking Nearest Neighbor Searches in Python
Naive nearest neighbor searches scale as O[N2]; the tree-based methods here scale as O[NlogN]. Both the ball tree and kd-tree have their ...
Read more >k nearest neighbors computational complexity
It supports brute force, k-d tree and ball tree data structures. ... The time complexity is usually O(d * n * log(n)) ,...
Read more >Why lookup in a Binary Search Tree is O(log(n))?
So, my question is: If we have a tree of N elements, why the time complexity of looking up the tree and check...
Read more >STAT 479: Machine Learning Lecture Notes
While nearest neighbor algorithms are not as popular as they once were, ... KD-Trees data structures have a time complexity of O(log(n)) on...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@simsim314 I think it’s supposed to be better than KD trees when the dimension is “high” but not very high. For very high dimensions, I don’t know if an exact nearest neighbors algorithm can escape the curse of dimensionality and maintain logarithmic performance. I am aware that the documentation you quoted says “very high dimensions”, but that might be misleading.
On another note, that quote does say “on highly structured data” and this thread has so far been limited to random uniformly distributed data. The next paragraph in the documentation reiterates:
I’m still not convinced it’s a bug. When you think about how balltree works, it makes sense. For example, suppose we have a query point q and a left and right node, N1 and N2. Then the nodes determine intervals (l1, u1) and (l2, u2) such that the distance |q-x| must belong to the interval for all x in the corresponding node. Now when the dimension is very high, there is very likely some overlap between the intervals (l1, u1) and (l2, u2), due to the concentration of distance phenomenon. Which means both nodes N1 and N2 have to be searched.
No. It’s similiar but different. He is talking about tree load time and my report is about the query time. His issue is also reproducable on single data case, my issue is reproducable on random case i.e. most cases.