Kmeans and memory overflowing
See original GitHub issueDescription
I am wondering if clustering with kmeans for 250000 samples into 6000 cluster is a too hard problem to compute because it kills even server with 12 cores, 258GB RAM and 60GB swap.
Similar “questions”:
- python memory error for kmeans in scikit-learn
- Memory Error when fitting the data using sklearn package
Code to Reproduce
The use-case us following:
import numpy as np
from sklearn import cluster
locations = np.random.random((250000, 2)) * 5
kmean = cluster.KMeans(n_clusters=6000, n_init=10, max_iter=150,
verbose=True, n_jobs=20, copy_x=False,
precompute_distances=False)
kmean.fit(locations)
print (kmean.cluster_centers_)
Actual Results
Iteration 35, inertia 156.384475435
center shift 7.768886e-03 within tolerance 2.084699e-04
Traceback (most recent call last):
File "test_kmeans.py", line 8, in <module>
kmean.fit(locations)
File "/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 889, in fit
return_n_iter=True)
File "/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 362, in k_means
for seed in seeds)
File "/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 768, in __call__
self.retrieve()
File "/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 719, in retrieve
raise exception
sklearn.externals.joblib.my_exceptions.JoblibMemoryError: JoblibMemoryError
___________________________________________________________________________
Multiprocessing exception:
...........................................................................
/mnt/datagrid/personal/borovec/Dropbox/Workspace/Uplus_fraud-monitoring/test_kmeans.py in <module>()
3
4 locations = np.random.random((250000, 2)) * 5
5 kmean = cluster.KMeans(n_clusters=6000, n_init=10, max_iter=150,
6 verbose=True, n_jobs=20, copy_x=False,
7 precompute_distances=False)
----> 8 kmean.fit(locations)
9 print (kmean.cluster_centers_)
10
11
12
...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/k_means_.py in fit(self=KMeans(algorithm='auto', copy_x=False, init='k-m...
random_state=None, tol=0.0001, verbose=True), X=array([[-1.86344999, 0.05621132],
[ 0.88...-1.20243728],
[ 0.97877704, 1.24561138]]), y=None)
884 X, n_clusters=self.n_clusters, init=self.init,
885 n_init=self.n_init, max_iter=self.max_iter, verbose=self.verbose,
886 precompute_distances=self.precompute_distances,
887 tol=self.tol, random_state=random_state, copy_x=self.copy_x,
888 n_jobs=self.n_jobs, algorithm=self.algorithm,
--> 889 return_n_iter=True)
890 return self
891
892 def fit_predict(self, X, y=None):
893 """Compute cluster centers and predict cluster index for each sample.
...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/k_means_.py in k_means(X=array([[-1.86344999, 0.05621132],
[ 0.88...-1.20243728],
[ 0.97877704, 1.24561138]]), n_clusters=6000, init='k-means++', precompute_distances=False, n_init=10, max_iter=150, verbose=True, tol=0.00020846993669604294, random_state=<mtrand.RandomState object>, copy_x=False, n_jobs=20, algorithm='elkan', return_n_iter=True)
357 verbose=verbose, tol=tol,
358 precompute_distances=precompute_distances,
359 x_squared_norms=x_squared_norms,
360 # Change seed to ensure variety
361 random_state=seed)
--> 362 for seed in seeds)
seeds = array([ 968587040, 226617041, 2063896048, 6552... 393005117, 134324550, 14152465, 2054736812])
363 # Get results with the lowest inertia
364 labels, inertia, centers, n_iters = zip(*results)
365 best = np.argmin(inertia)
366 best_labels = labels[best]
...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=Parallel(n_jobs=20), iterable=<generator object <genexpr>>)
763 if pre_dispatch == "all" or n_jobs == 1:
764 # The iterable was consumed all at once by the above for loop.
765 # No need to wait for async callbacks to trigger to
766 # consumption.
767 self._iterating = False
--> 768 self.retrieve()
self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=20)>
769 # Make sure that we get a last message telling us we are done
770 elapsed_time = time.time() - self._start_time
771 self._print('Done %3i out of %3i | elapsed: %s finished',
772 (len(self._output), len(self._output),
---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
MemoryError Tue Oct 17 16:11:14 2017
PID: 18062 Python 2.7.9: /mnt/home.dokt/borovji3/vEnv/bin/python
...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
126 def __init__(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
func = <function _kmeans_single_elkan>
args = (memmap([[-1.86344999, 0.05621132],
[ 0....1.20243728],
[ 0.97877704, 1.24561138]]), 6000)
kwargs = {'init': 'k-means++', 'max_iter': 150, 'precompute_distances': False, 'random_state': 134324550, 'tol': 0.00020846993669604294, 'verbose': True, 'x_squared_norms': memmap([ 3.47560557, 1.66662896, 0.19072331, ..., 1.87283488,
3.21604332, 2.50955219])}
self.items = [(<function _kmeans_single_elkan>, (memmap([[-1.86344999, 0.05621132],
[ 0....1.20243728],
[ 0.97877704, 1.24561138]]), 6000), {'init': 'k-means++', 'max_iter': 150, 'precompute_distances': False, 'random_state': 134324550, 'tol': 0.00020846993669604294, 'verbose': True, 'x_squared_norms': memmap([ 3.47560557, 1.66662896, 0.19072331, ..., 1.87283488,
3.21604332, 2.50955219])})]
132
133 def __len__(self):
134 return self._size
135
...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/k_means_.py in _kmeans_single_elkan(X=array([[-1.86344999, 0.05621132],
[ 0.88...-1.20243728],
[ 0.97877704, 1.24561138]]), n_clusters=6000, max_iter=150, init='k-means++', verbose=True, x_squared_norms=memmap([ 3.47560557, 1.66662896, 0.19072331, ..., 1.87283488,
3.21604332, 2.50955219]), random_state=<mtrand.RandomState object>, tol=0.00020846993669604294, precompute_distances=False)
394 x_squared_norms=x_squared_norms)
395 centers = np.ascontiguousarray(centers)
396 if verbose:
397 print('Initialization complete')
398 centers, labels, n_iter = k_means_elkan(X, n_clusters, centers, tol=tol,
--> 399 max_iter=max_iter, verbose=verbose)
max_iter = 150
verbose = True
400 inertia = np.sum((X - centers[labels]) ** 2, dtype=np.float64)
401 return labels, inertia, centers, n_iter
402
403
...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/_k_means_elkan.so in sklearn.cluster._k_means_elkan.k_means_elkan (sklearn/cluster/_k_means_elkan.c:6961)()
225
226
227
228
229
--> 230
231
232
233
234
MemoryError:
___________________________________________________________________________
Versions
Python 2.7.9 (default, Mar 1 2015, 12:57:24) [GCC 4.9.2] on linux2 numpy==1.13.1 scipy==0.19.1 scikit-learn==0.18.1
Issue Analytics
- State:
- Created 6 years ago
- Comments:20 (18 by maintainers)
Top Results From Across the Web
How to avoid memory leak when dealing with KMeans for ...
UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can...
Read more >Memory requirements of $k$-means clustering - Cross Validated
Algorithms like Lloyds can be implemented with k⋅(2⋅d+1) floating point values memory use only. MacQueens k-means algorithm should only need k⋅(d+1) ...
Read more >Tackling high RAM usage from KMeans in Python - Medium
Tackling high RAM usage from KMeans in Python. Simple reading material for clustering a high number of clusters with a limited amount of...
Read more >k-Means Clustering on Two-Level Memory Systems
External memory algorithms for k-means [22] partition ... cache will be flooded by roughly nine centers. Combining this with our bound k <...
Read more >Patches — TreeCorr 4.3.0 documentation - GitHub Pages
To save memory. ... To run k-means on some data set for non-correlation reasons. ... the additional memory from building the tree overflows...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
try
algorithm="full"
I confirm that with #11950 I can run your script on my laptop without memory error.