question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BallTree query match time is O(n) not O(log(n))

See original GitHub issue

I’ve run performance analysis on matching NN with BallTree (same with KDTree), and the matching time is linear to number of elements, and should be O(log(n)).

Here are the result of benchmark: num_elements, match_time 10000 0.09097146987915039 20000 0.18194293975830078 40000 0.3668830394744873 80000 0.7527577877044678

Here is my code:


from sklearn.neighbors import BallTree
import numpy as np 
import time 


def tree_perf(tree_size):
    X = np.random.rand(tree_size, 512)
    Y1 = np.random.rand(1, 512)
    Y2 = np.random.rand(10, 512)

    ts = time.time()
    kdt = BallTree(X, leaf_size=30, metric='euclidean')
    load_tree = time.time() - ts
    num_nn = 1
    ts = time.time()
    vs = kdt.query(Y1, k=num_nn, return_distance=True)
    match1 = time.time() - ts
    ts = time.time()
    vs = kdt.query(Y2, k=num_nn, return_distance=True)
    match10 = time.time() - ts
    print(tree_size, load_tree, match1, match10)

print("num_elements", "load_tree", "match_1", "match_10")

for i in range(100):
    tree_size = 10000 + i * 10000
    tree_perf(tree_size)

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:18 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
gkaranikascommented, Feb 10, 2021

What’s the point of balltree then?

@simsim314 I think it’s supposed to be better than KD trees when the dimension is “high” but not very high. For very high dimensions, I don’t know if an exact nearest neighbors algorithm can escape the curse of dimensionality and maintain logarithmic performance. I am aware that the documentation you quoted says “very high dimensions”, but that might be misleading.

On another note, that quote does say “on highly structured data” and this thread has so far been limited to random uniformly distributed data. The next paragraph in the documentation reiterates:

it can out-perform a KD-tree in high dimensions, though the actual performance is highly dependent on the structure of the training data

I’m still not convinced it’s a bug. When you think about how balltree works, it makes sense. For example, suppose we have a query point q and a left and right node, N1 and N2. Then the nodes determine intervals (l1, u1) and (l2, u2) such that the distance |q-x| must belong to the interval for all x in the corresponding node. Now when the dimension is very high, there is very likely some overlap between the intervals (l1, u1) and (l2, u2), due to the concentration of distance phenomenon. Which means both nodes N1 and N2 have to be searched.

1reaction
simsim314commented, Jan 28, 2021

No. It’s similiar but different. He is talking about tree load time and my report is about the query time. His issue is also reproducable on single data case, my issue is reproducable on random case i.e. most cases.

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.neighbors.BallTree — scikit-learn 1.2.0 documentation
BallTree for fast generalized N-point problems ... Changing leaf_size will not affect the results of a query, ... Query for k-nearest neighbors.
Read more >
Benchmarking Nearest Neighbor Searches in Python
Naive nearest neighbor searches scale as O[N2]; the tree-based methods here scale as O[NlogN]. Both the ball tree and kd-tree have their ...
Read more >
k nearest neighbors computational complexity
It supports brute force, k-d tree and ball tree data structures. ... The time complexity is usually O(d * n * log(n)) ,...
Read more >
Why lookup in a Binary Search Tree is O(log(n))?
So, my question is: If we have a tree of N elements, why the time complexity of looking up the tree and check...
Read more >
STAT 479: Machine Learning Lecture Notes
While nearest neighbor algorithms are not as popular as they once were, ... KD-Trees data structures have a time complexity of O(log(n)) on...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found