Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

n_jobs for DBSCAN

See original GitHub issue

#16213 ## Describe the bug

n_jobs argument doesn’t seem to change the time it takes to run DBSCAN.fit(). Runtime is the same with and without n_jobs. Is n_jobs actually implemented for DBSCAN.fit()?

Steps/Code to Reproduce

Example:

import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
from time import time
from sklearn.cluster import DBSCAN

# generate a symmetric distance matrix
num_training_examples = 30000
num_features = 10
X = np.random.randint(5, size=(num_training_examples, num_features))
D = euclidean_distances(X,X)

# DBSCAN parameters
eps = 0.25
kmedian_thresh = 0.005
min_samples = 5

# case 1: omit n_jobs arg from DBSCAN
start = time()
db = DBSCAN(eps=eps,
            min_samples = min_samples,
            metric='precomputed').fit(D)
end = time()
total_time = end - start
print('DBSCAN took {} seconds for {} training examples without n_jobs arg'\
       .format(total_time,num_training_examples))


# case 2: add n_jobs arg to DBSCAN
n_jobs = -1
start = time()
db = DBSCAN(eps=eps,
            min_samples = min_samples,
            metric='precomputed',
            n_jobs=n_jobs).fit(D)
end = time()
total_time = end - start
print('DBSCAN took {} seconds for {} training examples with n_jobs arg'\
       .format(total_time,num_training_examples,n_jobs))

Sample code to reproduce the problem


#### Expected Results

Expected runtime to decrease with more processors.

#### Actual Results

Runtime basically unchanged.

DBSCAN took 285.76699996 seconds for 30000 training examples without n_jobs arg
DBSCAN took 363.289000034 seconds for 30000 training examples with n_jobs arg

#### Versions
    Cython: None
     scipy: 1.2.2
setuptools: 41.6.0
       pip: 19.3.1
     numpy: 1.16.5
    pandas: 0.24.2
   sklearn: 0.20.4
import sys; print("Python", sys.version)
('Python', '2.7.17 (v2.7.17:c2f86d86e6, Oct 19 2019, 21:01:17) [MSC v.1500 64 bit (AMD64)]')

Issue Analytics

State:
Created 4 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

1reaction

glemaitrecommented, Jan 31, 2020

go for it

On Fri, 31 Jan 2020 at 15:45, JohanWork notifications@github.com wrote:

HI! If It is okey I would like to fixit and update the documentation.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/scikit-learn/issues/16299?email_source=notifications&email_token=ABY32P5ZSMUJPBLB52NAK63RAQ2PRA5CNFSM4KNNF6XKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKO3UNY#issuecomment-580762167, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABY32P6B47YQX3PM6DAMCWDRAQ2PRANCNFSM4KNNF6XA .

– Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/

0reactions

nilichencommented, Mar 11, 2020

This issue should be closed by #16615?

Top Results From Across the Web

sklearn.cluster.DBSCAN — scikit-learn 1.2.0 documentation

Perform DBSCAN clustering from vector array or distance matrix. DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high ......

How Does DBSCAN Clustering Work? - Analytics Vidhya

It groups 'densely grouped' data points into a single cluster. It can identify clusters in large spatial datasets by looking at the local ......

DBSCAN Algorithm Clustering in Python - Section.io

DBSCAN algorithm group points based on distance measurement. To cluster data points, this algorithm separates the high-density regions of ...

DBSCAN Clustering Algorithm in Machine Learning

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm for density-based clustering. It can discover clusters ...

How DBSCAN works and why should we use it?

Density-based spatial clustering of applications with noise (DBSCAN) is a well-known data clustering algorithm that is commonly used in data mining and ...