Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Nearest neighbors with trees perf decreased by debugging stats

See original GitHub issue

Description

For ball_tree and kd_tree algorithms, some stats about the tree queries highly decrease the parallelization performances increase.

Those stats are:

n_trims: queried points outside node radius
n_leaves: leaves reached while querying
n_splits: non-leaves queried nodes
n_calls: num of computed distances

Those stats only seem useful for debugging, do not look like part of the official API (no documentation) and only 2 (personal) git repos use the method (get_tree_stats) to get them.

Deactivating them highly improves performances of associated algorithms.

Benchmark

Test of kneighbors function with default parameters and:

samples dimension: 100
fit: 10k samples
kneighbors: 10k samples

(also tested openMP prange parallism but it does not improve perf)

=============
=== brute ===
=============
Joblib (loky) :
- n_jobs = 1 (MKL mono threaded) -> 2.6s
- n_jobs = 1 (MKL multi threaded, 40 threads) -> 1.9s
- n_jobs = 4  -> 4.0s
- n_jobs = 10 -> 3.5s
- n_jobs = 40 -> 3.5s

=================
=== ball_tree ===
=================
Joblib (loky) :
- n_jobs = 1  -> 10.9s
- n_jobs = 4  ->  7.7s
- n_jobs = 10 ->  6.8s
- n_jobs = 40 ->  3.8s

Joblib (loky) no stats:
- n_jobs = 1  -> 12.0s
- n_jobs = 4  ->  3.2s
- n_jobs = 10 ->  1.4s
- n_jobs = 40 ->  0.6s

OpenMP no stats:
- n_jobs = 4  ->  3.2s
- n_jobs = 10 ->  1.4s

===============
=== kd_tree ===
===============
Joblib (loky) :
- n_jobs = 1  -> 19.1s
- n_jobs = 4  ->  9.0s
- n_jobs = 10 ->  10.9s
- n_jobs = 40 ->  8.5s

Joblib (loky) no stats:
- n_jobs = 1  -> 19.0s
- n_jobs = 4  ->  5.1s
- n_jobs = 10 ->  2.2s
- n_jobs = 40 ->  1.0s

OpenMP no stats:
- n_jobs = 4  ->  5.1s
- n_jobs = 10 ->  2.2s

Issue Analytics

State:
Created 5 years ago
Reactions:2
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

ogriselcommented, Apr 13, 2021

The performance scaling without stats seems indeed almost perfect (as x 1/n_jobs) while it’s much worse when they are enabled.

One possible explanation would be a typical case of False Sharing: CPU cache invalidation by concurrent write access in contiguously allocated data structure fields that live in the same cache line.

0reactions

ogriselcommented, Apr 14, 2021

One way to check this hypothesis would be to use linux perf or cachegrind to collect cache invalidation statistics with and without #19884.

Top Results From Across the Web

[WIP] ENH : Nearest-neighbors removal of unused stats ...

By deactivating some undocumented debugging stats: improves perf gain with n_jobs > 1 for nearest neighbors based on tree algorithms (cf. benchmark in ......

Performance Optimization for the K Nearest-Neighbor Kernel ...

Nearest neighbor search is a cornerstone problem in compu- tational geometry, non-parametric statistics, and machine learning.

Fast Nearest Neighbor Queries in Haskell - Mike Izbicki

Two weeks ago at ICML, I presented a method for making nearest neighbor queries faster. The paper is called Faster Cover Trees and...

1.6. Nearest Neighbors — scikit-learn 1.2.0 documentation

As k becomes large compared to N , the ability to prune branches in a tree-based query is reduced. In this situation, Brute...

List of Debugger Built-in Rules - Amazon SageMaker

Analyze tensors emitted during the training of machine learning models with Amazon SageMaker Debugger built-in rules.