DBSCAN running out of memory and getting killed
See original GitHub issueDescribe the bug
I am passing data normalized using MinMaxScaler to DBSCAN’s fit_predict. My data is very small (12 MB, around 180,000 rows and 9 columns). However while running this, the memory usage quickly climbs up and the kernel gets killed (I presume by OOM killer). I even tried it on a server with 256 GB RAM and it fails fairly quickly.
Here is my repro code:
import pandas as pd
X_ml = pd.read_csv('Xml.csv')
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.28, min_samples=9)
dbscan_pred = dbscan.fit_predict(X_ml)
and here is my Xml.csv data file.
Any ideas how to get it working?
Steps/Code to Reproduce
import pandas as pd
X_ml = pd.read_csv('Xml.csv')
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.28, min_samples=9)
dbscan_pred = dbscan.fit_predict(X_ml)
and here is my Xml.csv data file.
Expected Results
DBSCAN should complete.
Actual Results
It runs out of memory getting killed.
Versions
System:
python: 3.9.9 (main, Nov 21 2021, 03:23:42) [Clang 13.0.0 (clang-1300.0.29.3)]
executable: /usr/local/opt/python@3.9/bin/python3.9
machine: macOS-12.2.1-x86_64-i386-64bit
Python dependencies:
pip: 21.3.1
setuptools: 59.0.1
sklearn: 1.0.2
numpy: 1.22.0
scipy: 1.7.3
Cython: None
pandas: 1.3.5
matplotlib: 3.5.1
joblib: 1.1.0
threadpoolctl: 3.1.0
Built with OpenMP: True
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Scikit-learn's DBSCAN quickly running out of memory ... - Kaggle
My data is very small (12 MB, around 180,000 rows and 9 columns). However while running this, the memory usage quickly climbs up...
Read more >scikit learn - DBSCAN running out of memory and getting killed
I am passing data normalized using MinMaxScaler to DBSCAN's fit_predict. My data is very small (12 MB, around 180,000 rows and 9 columns)....
Read more >How to summarize and understand the reults of DBSCAN ...
I run out of memory on a single computer. k-means produces notable results. (On high dimensional data, k-means is usually only as effective ......
Read more >OOMKilled: Troubleshooting Kubernetes Memory Requests ...
If the system is in danger of running out of available memory, OOM Killer will come in ... Troubleshooting OOMKilled events manually can...
Read more >A fast DBSCAN algorithm for big data based on efficient ...
A major part of the DBSCAN run-time is spent to calculate the distance of data from each other to find the neighbors of...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
DBSCAN has a worst case memory complexity O(n^2), which for 180000 samples corresponds to a little more than 259GB. This worst case situation can happen if eps is too large or min_samples too low, ending with all points being in a same cluster.
However it does not seem to be the only issue here. Your dataset contains a lot of duplicates. This also increase the memory consumption because all these duplicates are in the neighborhood of each other regardless of eps.
I suggest that you first remove the duplicates from your dataset. You can use sample weights if you want to still use the fact that some points were duplicated. A point with a sample weight of n is equivalent to having n copies of this point.
That’s the
eps
parameter. See https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html