Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Kmeans and memory overflowing

See original GitHub issue

Description

I am wondering if clustering with kmeans for 250000 samples into 6000 cluster is a too hard problem to compute because it kills even server with 12 cores, 258GB RAM and 60GB swap.

Similar “questions”:

Code to Reproduce

The use-case us following:

import numpy as np
from sklearn import cluster

locations = np.random.random((250000, 2)) * 5
kmean = cluster.KMeans(n_clusters=6000, n_init=10, max_iter=150,
                       verbose=True, n_jobs=20, copy_x=False,
                       precompute_distances=False)
kmean.fit(locations)
print (kmean.cluster_centers_)

Actual Results

Iteration 35, inertia 156.384475435
center shift 7.768886e-03 within tolerance 2.084699e-04
Traceback (most recent call last):
  File "test_kmeans.py", line 8, in <module>
    kmean.fit(locations)
  File "/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 889, in fit
    return_n_iter=True)
  File "/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 362, in k_means
    for seed in seeds)
  File "/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 768, in __call__  
    self.retrieve()
  File "/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 719, in retrieve  
    raise exception
sklearn.externals.joblib.my_exceptions.JoblibMemoryError: JoblibMemoryError
___________________________________________________________________________
Multiprocessing exception:
...........................................................................
/mnt/datagrid/personal/borovec/Dropbox/Workspace/Uplus_fraud-monitoring/test_kmeans.py in <module>()
      3
      4 locations = np.random.random((250000, 2)) * 5
      5 kmean = cluster.KMeans(n_clusters=6000, n_init=10, max_iter=150,
      6                        verbose=True, n_jobs=20, copy_x=False,
      7                        precompute_distances=False)
----> 8 kmean.fit(locations)
      9 print (kmean.cluster_centers_)
     10
     11
     12

...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/k_means_.py in fit(self=KMeans(algorithm='auto', copy_x=False, init='k-m...
    random_state=None, tol=0.0001, verbose=True), X=array([[-1.86344999,  0.05621132],
       [ 0.88...-1.20243728],
       [ 0.97877704,  1.24561138]]), y=None)

    884                 X, n_clusters=self.n_clusters, init=self.init,
    885                 n_init=self.n_init, max_iter=self.max_iter, verbose=self.verbose,
    886                 precompute_distances=self.precompute_distances,
    887                 tol=self.tol, random_state=random_state, copy_x=self.copy_x,
    888                 n_jobs=self.n_jobs, algorithm=self.algorithm,
--> 889                 return_n_iter=True)
    890         return self
    891
    892     def fit_predict(self, X, y=None):
    893         """Compute cluster centers and predict cluster index for each sample.

...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/k_means_.py in k_means(X=array([[-1.86344999,  0.05621132],
       [ 0.88...-1.20243728],
       [ 0.97877704,  1.24561138]]), n_clusters=6000, init='k-means++', precompute_distances=False, n_init=10, max_iter=150, verbose=True, tol=0.00020846993669604294, random_state=<mtrand.RandomState object>, copy_x=False, n_jobs=20, algorithm='elkan', return_n_iter=True)
    357                                    verbose=verbose, tol=tol,
    358                                    precompute_distances=precompute_distances,
    359                                    x_squared_norms=x_squared_norms,
    360                                    # Change seed to ensure variety
    361                                    random_state=seed)
--> 362             for seed in seeds)
        seeds = array([ 968587040,  226617041, 2063896048,  6552...  393005117,  134324550,   14152465, 2054736812])
    363         # Get results with the lowest inertia
    364         labels, inertia, centers, n_iters = zip(*results)
    365         best = np.argmin(inertia)
    366         best_labels = labels[best]

...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=Parallel(n_jobs=20), iterable=<generator object <genexpr>>)
    763             if pre_dispatch == "all" or n_jobs == 1:
    764                 # The iterable was consumed all at once by the above for loop.
    765                 # No need to wait for async callbacks to trigger to
    766                 # consumption.
    767                 self._iterating = False
--> 768             self.retrieve()
        self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=20)>
    769             # Make sure that we get a last message telling us we are done
    770             elapsed_time = time.time() - self._start_time
    771             self._print('Done %3i out of %3i | elapsed: %s finished',
    772                         (len(self._output), len(self._output),

---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
MemoryError                                        Tue Oct 17 16:11:14 2017
PID: 18062            Python 2.7.9: /mnt/home.dokt/borovji3/vEnv/bin/python
...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        func = <function _kmeans_single_elkan>
        args = (memmap([[-1.86344999,  0.05621132],
        [ 0....1.20243728],
        [ 0.97877704,  1.24561138]]), 6000)
        kwargs = {'init': 'k-means++', 'max_iter': 150, 'precompute_distances': False, 'random_state': 134324550, 'tol': 0.00020846993669604294, 'verbose': True, 'x_squared_norms': memmap([ 3.47560557,  1.66662896,  0.19072331, ...,  1.87283488,
         3.21604332,  2.50955219])}
        self.items = [(<function _kmeans_single_elkan>, (memmap([[-1.86344999,  0.05621132],
        [ 0....1.20243728],
        [ 0.97877704,  1.24561138]]), 6000), {'init': 'k-means++', 'max_iter': 150, 'precompute_distances': False, 'random_state': 134324550, 'tol': 0.00020846993669604294, 'verbose': True, 'x_squared_norms': memmap([ 3.47560557,  1.66662896,  0.19072331, ...,  1.87283488,
         3.21604332,  2.50955219])})]
    132
    133     def __len__(self):
    134         return self._size
    135

...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/k_means_.py in _kmeans_single_elkan(X=array([[-1.86344999,  0.05621132],
       [ 0.88...-1.20243728],
       [ 0.97877704,  1.24561138]]), n_clusters=6000, max_iter=150, init='k-means++', verbose=True, x_squared_norms=memmap([ 3.47560557,  1.66662896,  0.19072331, ...,  1.87283488,
         3.21604332,  2.50955219]), random_state=<mtrand.RandomState object>, tol=0.00020846993669604294, precompute_distances=False)
    394                               x_squared_norms=x_squared_norms)
    395     centers = np.ascontiguousarray(centers)
    396     if verbose:
    397         print('Initialization complete')
    398     centers, labels, n_iter = k_means_elkan(X, n_clusters, centers, tol=tol,
--> 399                                             max_iter=max_iter, verbose=verbose)
        max_iter = 150
        verbose = True
    400     inertia = np.sum((X - centers[labels]) ** 2, dtype=np.float64)
    401     return labels, inertia, centers, n_iter
    402
    403

...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/_k_means_elkan.so in sklearn.cluster._k_means_elkan.k_means_elkan (sklearn/cluster/_k_means_elkan.c:6961)()
    225
    226
    227
    228
    229
--> 230
    231
    232
    233
    234

MemoryError:
___________________________________________________________________________

Versions

Python 2.7.9 (default, Mar 1 2015, 12:57:24) [GCC 4.9.2] on linux2 numpy==1.13.1 scipy==0.19.1 scikit-learn==0.18.1

Issue Analytics

State:
Created 6 years ago
Comments:20 (18 by maintainers)

Top GitHub Comments

2reactions

amuellercommented, Oct 17, 2017

try algorithm="full"

1reaction

jeremiedbbcommented, Feb 19, 2019

I confirm that with #11950 I can run your script on my laptop without memory error.

Top Results From Across the Web

How to avoid memory leak when dealing with KMeans for ...

UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can...

Memory requirements of $k$-means clustering - Cross Validated

Algorithms like Lloyds can be implemented with k⋅(2⋅d+1) floating point values memory use only. MacQueens k-means algorithm should only need k⋅(d+1) ...

Tackling high RAM usage from KMeans in Python - Medium

Tackling high RAM usage from KMeans in Python. Simple reading material for clustering a high number of clusters with a limited amount of...

k-Means Clustering on Two-Level Memory Systems

External memory algorithms for k-means [22] partition ... cache will be flooded by roughly nine centers. Combining this with our bound k <...

Patches — TreeCorr 4.3.0 documentation - GitHub Pages

To save memory. ... To run k-means on some data set for non-correlation reasons. ... the additional memory from building the tree overflows...