question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DBSCAN running out of memory and getting killed

See original GitHub issue

Describe the bug

I am passing data normalized using MinMaxScaler to DBSCAN’s fit_predict. My data is very small (12 MB, around 180,000 rows and 9 columns). However while running this, the memory usage quickly climbs up and the kernel gets killed (I presume by OOM killer). I even tried it on a server with 256 GB RAM and it fails fairly quickly.

Here is my repro code:

import pandas as pd
X_ml = pd.read_csv('Xml.csv')
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.28, min_samples=9)
dbscan_pred = dbscan.fit_predict(X_ml)

and here is my Xml.csv data file.

Any ideas how to get it working?

Steps/Code to Reproduce

import pandas as pd
X_ml = pd.read_csv('Xml.csv')
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.28, min_samples=9)
dbscan_pred = dbscan.fit_predict(X_ml)

and here is my Xml.csv data file.

Expected Results

DBSCAN should complete.

Actual Results

It runs out of memory getting killed.

Versions

System:
    python: 3.9.9 (main, Nov 21 2021, 03:23:42)  [Clang 13.0.0 (clang-1300.0.29.3)]
executable: /usr/local/opt/python@3.9/bin/python3.9
   machine: macOS-12.2.1-x86_64-i386-64bit

Python dependencies:
          pip: 21.3.1
   setuptools: 59.0.1
      sklearn: 1.0.2
        numpy: 1.22.0
        scipy: 1.7.3
       Cython: None
       pandas: 1.3.5
   matplotlib: 3.5.1
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

4reactions
jeremiedbbcommented, Feb 18, 2022

DBSCAN has a worst case memory complexity O(n^2), which for 180000 samples corresponds to a little more than 259GB. This worst case situation can happen if eps is too large or min_samples too low, ending with all points being in a same cluster.

However it does not seem to be the only issue here. Your dataset contains a lot of duplicates. This also increase the memory consumption because all these duplicates are in the neighborhood of each other regardless of eps.

I suggest that you first remove the duplicates from your dataset. You can use sample weights if you want to still use the fact that some points were duplicated. A point with a sample weight of n is equivalent to having n copies of this point.

0reactions
jeremiedbbcommented, Feb 22, 2022
Read more comments on GitHub >

github_iconTop Results From Across the Web

Scikit-learn's DBSCAN quickly running out of memory ... - Kaggle
My data is very small (12 MB, around 180,000 rows and 9 columns). However while running this, the memory usage quickly climbs up...
Read more >
scikit learn - DBSCAN running out of memory and getting killed
I am passing data normalized using MinMaxScaler to DBSCAN's fit_predict. My data is very small (12 MB, around 180,000 rows and 9 columns)....
Read more >
How to summarize and understand the reults of DBSCAN ...
I run out of memory on a single computer. k-means produces notable results. (On high dimensional data, k-means is usually only as effective ......
Read more >
OOMKilled: Troubleshooting Kubernetes Memory Requests ...
If the system is in danger of running out of available memory, OOM Killer will come in ... Troubleshooting OOMKilled events manually can...
Read more >
A fast DBSCAN algorithm for big data based on efficient ...
A major part of the DBSCAN run-time is spent to calculate the distance of data from each other to find the neighbors of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found