Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BallTree query match time is O(n) not O(log(n))

See original GitHub issue

I’ve run performance analysis on matching NN with BallTree (same with KDTree), and the matching time is linear to number of elements, and should be O(log(n)).

Here are the result of benchmark: num_elements, match_time 10000 0.09097146987915039 20000 0.18194293975830078 40000 0.3668830394744873 80000 0.7527577877044678

Here is my code:


from sklearn.neighbors import BallTree
import numpy as np 
import time 


def tree_perf(tree_size):
    X = np.random.rand(tree_size, 512)
    Y1 = np.random.rand(1, 512)
    Y2 = np.random.rand(10, 512)

    ts = time.time()
    kdt = BallTree(X, leaf_size=30, metric='euclidean')
    load_tree = time.time() - ts
    num_nn = 1
    ts = time.time()
    vs = kdt.query(Y1, k=num_nn, return_distance=True)
    match1 = time.time() - ts
    ts = time.time()
    vs = kdt.query(Y2, k=num_nn, return_distance=True)
    match10 = time.time() - ts
    print(tree_size, load_tree, match1, match10)

print("num_elements", "load_tree", "match_1", "match_10")

for i in range(100):
    tree_size = 10000 + i * 10000
    tree_perf(tree_size)

Issue Analytics

State:
Created 3 years ago
Comments:18 (7 by maintainers)

Top GitHub Comments

1reaction

gkaranikascommented, Feb 10, 2021

What’s the point of balltree then?

@simsim314 I think it’s supposed to be better than KD trees when the dimension is “high” but not very high. For very high dimensions, I don’t know if an exact nearest neighbors algorithm can escape the curse of dimensionality and maintain logarithmic performance. I am aware that the documentation you quoted says “very high dimensions”, but that might be misleading.

On another note, that quote does say “on highly structured data” and this thread has so far been limited to random uniformly distributed data. The next paragraph in the documentation reiterates:

it can out-perform a KD-tree in high dimensions, though the actual performance is highly dependent on the structure of the training data

I’m still not convinced it’s a bug. When you think about how balltree works, it makes sense. For example, suppose we have a query point q and a left and right node, N1 and N2. Then the nodes determine intervals (l1, u1) and (l2, u2) such that the distance |q-x| must belong to the interval for all x in the corresponding node. Now when the dimension is very high, there is very likely some overlap between the intervals (l1, u1) and (l2, u2), due to the concentration of distance phenomenon. Which means both nodes N1 and N2 have to be searched.

1reaction

simsim314commented, Jan 28, 2021

No. It’s similiar but different. He is talking about tree load time and my report is about the query time. His issue is also reproducable on single data case, my issue is reproducable on random case i.e. most cases.