question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

n_jobs for DBSCAN

See original GitHub issue

#16213 ## Describe the bug

n_jobs argument doesn’t seem to change the time it takes to run DBSCAN.fit(). Runtime is the same with and without n_jobs. Is n_jobs actually implemented for DBSCAN.fit()?

Steps/Code to Reproduce

Example:

import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
from time import time
from sklearn.cluster import DBSCAN

# generate a symmetric distance matrix
num_training_examples = 30000
num_features = 10
X = np.random.randint(5, size=(num_training_examples, num_features))
D = euclidean_distances(X,X)

# DBSCAN parameters
eps = 0.25
kmedian_thresh = 0.005
min_samples = 5

# case 1: omit n_jobs arg from DBSCAN
start = time()
db = DBSCAN(eps=eps,
            min_samples = min_samples,
            metric='precomputed').fit(D)
end = time()
total_time = end - start
print('DBSCAN took {} seconds for {} training examples without n_jobs arg'\
       .format(total_time,num_training_examples))


# case 2: add n_jobs arg to DBSCAN
n_jobs = -1
start = time()
db = DBSCAN(eps=eps,
            min_samples = min_samples,
            metric='precomputed',
            n_jobs=n_jobs).fit(D)
end = time()
total_time = end - start
print('DBSCAN took {} seconds for {} training examples with n_jobs arg'\
       .format(total_time,num_training_examples,n_jobs))

Sample code to reproduce the problem


#### Expected Results

Expected runtime to decrease with more processors.

#### Actual Results

Runtime basically unchanged.

DBSCAN took 285.76699996 seconds for 30000 training examples without n_jobs arg
DBSCAN took 363.289000034 seconds for 30000 training examples with n_jobs arg

#### Versions
    Cython: None
     scipy: 1.2.2
setuptools: 41.6.0
       pip: 19.3.1
     numpy: 1.16.5
    pandas: 0.24.2
   sklearn: 0.20.4
import sys; print("Python", sys.version)
('Python', '2.7.17 (v2.7.17:c2f86d86e6, Oct 19 2019, 21:01:17) [MSC v.1500 64 bit (AMD64)]')

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
glemaitrecommented, Jan 31, 2020

go for it

On Fri, 31 Jan 2020 at 15:45, JohanWork notifications@github.com wrote:

HI! If It is okey I would like to fixit and update the documentation.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/scikit-learn/issues/16299?email_source=notifications&email_token=ABY32P5ZSMUJPBLB52NAK63RAQ2PRA5CNFSM4KNNF6XKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKO3UNY#issuecomment-580762167, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABY32P6B47YQX3PM6DAMCWDRAQ2PRANCNFSM4KNNF6XA .

– Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/

0reactions
nilichencommented, Mar 11, 2020

This issue should be closed by #16615?

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.cluster.DBSCAN — scikit-learn 1.2.0 documentation
Perform DBSCAN clustering from vector array or distance matrix. DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high ......
Read more >
How Does DBSCAN Clustering Work? - Analytics Vidhya
It groups 'densely grouped' data points into a single cluster. It can identify clusters in large spatial datasets by looking at the local ......
Read more >
DBSCAN Algorithm Clustering in Python - Section.io
DBSCAN algorithm group points based on distance measurement. To cluster data points, this algorithm separates the high-density regions of ...
Read more >
DBSCAN Clustering Algorithm in Machine Learning
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm for density-based clustering. It can discover clusters ...
Read more >
How DBSCAN works and why should we use it?
Density-based spatial clustering of applications with noise (DBSCAN) is a well-known data clustering algorithm that is commonly used in data mining and ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found