memory leak in `k_means` for 0.23.dev0
See original GitHub issueRunning k_means
permanently uses memory.
Steps/Code to Reproduce
Save the following snipet in a script issue_memory.py
and run it with python3 -i issue_memory.py
.
from sklearn.cluster import k_means
import numpy as np
n_samples = 1000
n_features = 100000
data = np.random.normal(size=[n_samples , n_features]) # Some Gaussian random noise
_, part, _ = k_means(
data,
n_clusters=10,
init="k-means++",
max_iter=30,
n_init=10,
random_state=1,
)
del part
del data
Expected Results
the script will transiently use memory (a few GB), and after completion (before exiting the session) memory usage will be back precisely where it was prior to running the script. Exiting the session will not change memory level. This is what happens with sklearn 0.22.
Actual Results
Using the current tip of master (version 0.23.dev0), even after cleaning all variables, there is still about 500 MB of memory used in the session. I am working on ensemble clustering and, after many iterations, this memory leak eventually ends up filling all available RAM. If you exit the session, this memory is released.
Versions
System:
python: 3.7.5rc1 (default, Oct 8 2019, 16:47:45) [GCC 9.2.1 20191008]
executable: /home/pbellec/env/dypac/bin/python3
machine: Linux-5.3.0-46-generic-x86_64-with-Ubuntu-19.10-eoan
Python dependencies:
pip: 20.0.2
setuptools: 45.0.0
sklearn: 0.23.dev0
numpy: 1.18.1
scipy: 1.4.1
Cython: None
pandas: 0.25.3
matplotlib: 3.1.2
joblib: 0.14.1
Built with OpenMP: True
Linux-5.3.0-46-generic-x86_64-with-Ubuntu-19.10-eoan
Python 3.7.5rc1 (default, Oct 8 2019, 16:47:45)
[GCC 9.2.1 20191008]
NumPy 1.18.1
SciPy 1.4.1
Scikit-Learn 0.23.dev0
Traceback (most recent call last):
File "version_sklearn.py", line 7, in <module>
import imblearn; print("Imbalanced-Learn", imblearn.__version__)
ModuleNotFoundError: No module named 'imblearn'
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:8 (5 by maintainers)
Top Results From Across the Web
How to avoid memory leak when dealing with KMeans for ...
As the warning says: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than...
Read more >0.23 Milestone - GitHub
memory leak in k_means for 0.23.dev0 Blocker Bug. #16991 by pbellec was closed on Apr 22, 2020. 1 · 8 · [MRG] MNT...
Read more >Tackling high RAM usage from KMeans in Python - Medium
The solution works by dividing a single computing load into a smaller batch of clustering jobs.
Read more >Release History — scikit-learn 0.22.dev0 documentation
KMeans produced inconsistent results between n_jobs=1 and n_jobs>1 due to the ... now uses chunks of data at prediction step, thus capping the...
Read more >Simple Exploration Notebook - 2σ Connect | Kaggle
from sklearn.cluster import KMeans clusters = KMeans(n_clusters=5) ... in request(method, url, **kwargs) 54 # cases, and look like a memory leak in others....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks for the report @pbellec. It’s really cool to have people testing the master branch so that we can fix bugs before they are released 😃 !
Very impressed by the speed at which this issue was solved. Huge thanks for this @jeremiedbb @jnothman and all the work that has gone in this new
k_means
version, it is going to be a big help in my research.