Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DBSCAN running out of memory and getting killed

See original GitHub issue

Describe the bug

I am passing data normalized using MinMaxScaler to DBSCAN’s fit_predict. My data is very small (12 MB, around 180,000 rows and 9 columns). However while running this, the memory usage quickly climbs up and the kernel gets killed (I presume by OOM killer). I even tried it on a server with 256 GB RAM and it fails fairly quickly.

Here is my repro code:

import pandas as pd
X_ml = pd.read_csv('Xml.csv')
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.28, min_samples=9)
dbscan_pred = dbscan.fit_predict(X_ml)

and here is my Xml.csv data file.

Any ideas how to get it working?

Steps/Code to Reproduce

import pandas as pd
X_ml = pd.read_csv('Xml.csv')
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.28, min_samples=9)
dbscan_pred = dbscan.fit_predict(X_ml)

and here is my Xml.csv data file.

Expected Results

DBSCAN should complete.

Actual Results

It runs out of memory getting killed.

Versions

System:
    python: 3.9.9 (main, Nov 21 2021, 03:23:42)  [Clang 13.0.0 (clang-1300.0.29.3)]
executable: /usr/local/opt/python@3.9/bin/python3.9
   machine: macOS-12.2.1-x86_64-i386-64bit

Python dependencies:
          pip: 21.3.1
   setuptools: 59.0.1
      sklearn: 1.0.2
        numpy: 1.22.0
        scipy: 1.7.3
       Cython: None
       pandas: 1.3.5
   matplotlib: 3.5.1
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True

Issue Analytics

State:
Created 2 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

4reactions

jeremiedbbcommented, Feb 18, 2022

DBSCAN has a worst case memory complexity O(n^2), which for 180000 samples corresponds to a little more than 259GB. This worst case situation can happen if eps is too large or min_samples too low, ending with all points being in a same cluster.

However it does not seem to be the only issue here. Your dataset contains a lot of duplicates. This also increase the memory consumption because all these duplicates are in the neighborhood of each other regardless of eps.

I suggest that you first remove the duplicates from your dataset. You can use sample weights if you want to still use the fact that some points were duplicated. A point with a sample weight of n is equivalent to having n copies of this point.

0reactions

jeremiedbbcommented, Feb 22, 2022

That’s the eps parameter. See https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html