question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

memory leak in `k_means` for 0.23.dev0

See original GitHub issue

Running k_means permanently uses memory.

Steps/Code to Reproduce

Save the following snipet in a script issue_memory.py and run it with python3 -i issue_memory.py.

from sklearn.cluster import k_means
import numpy as np

n_samples = 1000
n_features = 100000
data = np.random.normal(size=[n_samples , n_features]) # Some Gaussian random noise

_, part, _ = k_means(
    data,
    n_clusters=10,
    init="k-means++",
    max_iter=30,
    n_init=10,
    random_state=1,
)
del part
del data

Expected Results

the script will transiently use memory (a few GB), and after completion (before exiting the session) memory usage will be back precisely where it was prior to running the script. Exiting the session will not change memory level. This is what happens with sklearn 0.22.

Actual Results

Using the current tip of master (version 0.23.dev0), even after cleaning all variables, there is still about 500 MB of memory used in the session. I am working on ensemble clustering and, after many iterations, this memory leak eventually ends up filling all available RAM. If you exit the session, this memory is released.

Versions

System:
    python: 3.7.5rc1 (default, Oct  8 2019, 16:47:45)  [GCC 9.2.1 20191008]
executable: /home/pbellec/env/dypac/bin/python3
   machine: Linux-5.3.0-46-generic-x86_64-with-Ubuntu-19.10-eoan

Python dependencies:
       pip: 20.0.2
setuptools: 45.0.0
   sklearn: 0.23.dev0
     numpy: 1.18.1
     scipy: 1.4.1
    Cython: None
    pandas: 0.25.3
matplotlib: 3.1.2
    joblib: 0.14.1

Built with OpenMP: True
Linux-5.3.0-46-generic-x86_64-with-Ubuntu-19.10-eoan
Python 3.7.5rc1 (default, Oct  8 2019, 16:47:45) 
[GCC 9.2.1 20191008]
NumPy 1.18.1
SciPy 1.4.1
Scikit-Learn 0.23.dev0
Traceback (most recent call last):
  File "version_sklearn.py", line 7, in <module>
    import imblearn; print("Imbalanced-Learn", imblearn.__version__)
ModuleNotFoundError: No module named 'imblearn'

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
jeremiedbbcommented, Apr 22, 2020

Thanks for the report @pbellec. It’s really cool to have people testing the master branch so that we can fix bugs before they are released 😃 !

0reactions
pbelleccommented, Apr 22, 2020

Very impressed by the speed at which this issue was solved. Huge thanks for this @jeremiedbb @jnothman and all the work that has gone in this new k_means version, it is going to be a big help in my research.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to avoid memory leak when dealing with KMeans for ...
As the warning says: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than...
Read more >
0.23 Milestone - GitHub
memory leak in k_means for 0.23.dev0 Blocker Bug. #16991 by pbellec was closed on Apr 22, 2020. 1 · 8 · [MRG] MNT...
Read more >
Tackling high RAM usage from KMeans in Python - Medium
The solution works by dividing a single computing load into a smaller batch of clustering jobs.
Read more >
Release History — scikit-learn 0.22.dev0 documentation
KMeans produced inconsistent results between n_jobs=1 and n_jobs>1 due to the ... now uses chunks of data at prediction step, thus capping the...
Read more >
Simple Exploration Notebook - 2σ Connect | Kaggle
from sklearn.cluster import KMeans clusters = KMeans(n_clusters=5) ... in request(method, url, **kwargs) 54 # cases, and look like a memory leak in others....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found