HistGradientBoostingClassifier causing memory leak in Ubuntu
See original GitHub issueIn order to perform appropriate model(s) selection, I’m performing KFOLD cross validation across 11 models and HistGradientBoostingClassifier is causing memory leak and freezes my machine 😿
from sklearn.model_selection import cross_val_score
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
models = [
#DecisionTreeClassifier(max_depth=16),
#MLPClassifier(alpha=1, max_iter=1000),
#AdaBoostClassifier(),
#KNeighborsClassifier(3),
#GaussianNB(),
HistGradientBoostingClassifier(random_state=0, min_samples_leaf=20, loss='categorical_crossentropy'),
#GradientBoostingClassifier(random_state=0),
#MultinomialNB(),
#LogisticRegression(random_state=0, solver='lbfgs', multi_class='auto'),
#RandomForestClassifier(n_estimators=500, max_depth=32, random_state=0),
#SVC(kernel='linear', probability=True, gamma='auto', random_state=0),
]
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
model_name = model.__class__.__name__
accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
for fold_idx, accuracy in enumerate(accuracies):
print(f'{model_name} - {fold_idx} {accuracy:.4f}')
entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
OS Description
Distributor ID: Ubuntu Description: Ubuntu 19.10 Release: 19.10
Python Ecosystem Details
conda 4.8.1 Python 3.6.6 :: Anaconda, Inc. Conda Environment Details
channels:
- anaconda
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- anaconda-client=1.7.2=py36_0
- anaconda-navigator=1.9.7=py36_0
- attrs=19.3.0=py_0
- backcall=0.1.0=py36_0
- blas=1.0=mkl
- bleach=3.1.0=py36_0
- ca-certificates=2020.1.1=0
- certifi=2019.11.28=py36_0
- chardet=3.0.4=py36_1003
- clyent=1.2.2=py36_1
- cycler=0.10.0=py36_0
- dbus=1.13.12=h746ee38_0
- decorator=4.4.1=py_0
- defusedxml=0.6.0=py_0
- entrypoints=0.3=py36_0
- expat=2.2.6=he6710b0_0
- fontconfig=2.13.0=h9420a91_0
- freetype=2.9.1=h8a8886c_1
- glib=2.56.2=hd408876_0
- gmp=6.1.2=h6c8ec71_1
- gst-plugins-base=1.14.0=hbbd80ab_1
- gstreamer=1.14.0=hb453b48_1
- icu=58.2=h211956c_0
- importlib_metadata=1.4.0=py36_0
- intel-openmp=2019.4=243
- ipykernel=5.1.4=py36h39e3cac_0
- ipython=7.11.1=py36h39e3cac_0
- ipython_genutils=0.2.0=py36_0
- jedi=0.16.0=py36_0
- jinja2=2.10.3=py_0
- joblib=0.14.1=py_0
- jpeg=9b=habf39ab_1
- jsonschema=3.2.0=py36_0
- jupyter_client=5.3.4=py36_0
- jupyter_core=4.6.1=py36_0
- kiwisolver=1.1.0=py36he6710b0_0
- libedit=3.1.20181209=hc058e9b_0
- libffi=3.2.1=hd88cf55_4
- libgcc-ng=9.1.0=hdf63c60_0
- libgfortran-ng=7.3.0=hdf63c60_0
- libpng=1.6.37=hbc83047_0
- libsodium=1.0.16=h1bed415_0
- libstdcxx-ng=9.1.0=hdf63c60_0
- libtiff=4.1.0=h2733197_0
- libuuid=1.0.3=h1bed415_2
- libxcb=1.13=h1bed415_1
- libxml2=2.9.9=hea5a465_1
- markupsafe=1.1.1=py36h7b6447c_0
- matplotlib=3.1.3=py36_0
- matplotlib-base=3.1.3=py36hef1b27d_0
- mistune=0.8.4=py36h7b6447c_0
- mkl=2019.4=243
- mkl-service=2.3.0=py36he904b0f_0
- mkl_fft=1.0.15=py36ha843d7b_0
- mkl_random=1.1.0=py36hd6b4f25_0
- more-itertools=8.0.2=py_0
- nbconvert=5.6.1=py36_0
- nbformat=5.0.4=py_0
- ncurses=6.1=he6710b0_1
- notebook=6.0.3=py36_0
- numpy=1.18.1=py36h4f9e942_0
- numpy-base=1.18.1=py36hde5b4d6_1
- olefile=0.46=py36_0
- openssl=1.0.2u=h7b6447c_0
- pandas=1.0.0=py36h0573a6f_0
- pandoc=2.2.3.2=0
- pandocfilters=1.4.2=py36_1
- parso=0.6.0=py_0
- pcre=8.43=he6710b0_0
- pexpect=4.8.0=py36_0
- pickleshare=0.7.5=py36_0
- pillow=7.0.0=py36hb39fc2d_0
- pip=20.0.2=py36_1
- prometheus_client=0.7.1=py_0
- prompt_toolkit=3.0.3=py_0
- psutil=5.6.7=py36h7b6447c_0
- ptyprocess=0.6.0=py36_0
- pygments=2.5.2=py_0
- pyparsing=2.4.6=py_0
- pyqt=5.9.2=py36h22d08a2_1
- pyrsistent=0.15.7=py36h7b6447c_0
- python=3.6.6=h6e4f718_2
- python-dateutil=2.8.1=py_0
- pytz=2019.3=py_0
- pyyaml=5.2=py36h7b6447c_0
- pyzmq=18.1.0=py36he6710b0_0
- qt=5.9.6=h8703b6f_2
- qtpy=1.9.0=py_0
- readline=7.0=h7b6447c_5
- requests=2.14.2=py36_0
- scikit-learn=0.22.1=py36hd81dba3_0
- scipy=1.4.1=py36h0b6359f_0
- seaborn=0.10.0=py_0
- send2trash=1.5.0=py36_0
- setuptools=45.1.0=py36_0
- sip=4.19.13=py36he6710b0_0
- six=1.14.0=py36_0
- sqlite=3.30.1=h7b6447c_0
- terminado=0.8.3=py36_0
- testpath=0.4.4=py_0
- tk=8.6.8=hbc83047_0
- tornado=6.0.3=py36h7b6447c_0
- traitlets=4.3.3=py36_0
- wcwidth=0.1.7=py36_0
- webencodings=0.5.1=py36_1
- wheel=0.34.1=py36_0
- xz=5.2.4=h14c3975_4
- yaml=0.1.7=h96e3832_1
- zeromq=4.3.1=he6710b0_3
- zipp=0.6.0=py_0
- zlib=1.2.11=h7b6447c_3
- zstd=1.3.7=h0b5b093_0
Issue Analytics
- State:
- Created 4 years ago
- Comments:12 (8 by maintainers)
Top Results From Across the Web
Does Ubuntu has memory leaks?
It seems Microsoft VSCode ( /usr/share/code/code ) has an extremely large "Virtual Memory Map", as well as multiple spawned processes.
Read more >Bug #1991709 “gnome-shell memory leak (when Ubuntu ...
gnome-shell memory leak (when Ubuntu AppIndicators is enabled). Bug #1991709 reported by Sergiu on 2022-10-04. 8. This bug affects 1 person ...
Read more >Memory leak using gridsearchcv - scikit learn - Stack Overflow
The cause of my issue was that i put n_jobs=-1 in gridsearchcv, when it should be placed in the classifier. This has solved...
Read more >Slab memory leak on Ubuntu Xenial stemcells
If you have a foundation with symptoms detailed above then please raise a ticket with Tanzu support. If the problem began after a...
Read more >Help test memory leak fixes in 18.04 LTS - Ubuntu Discourse
There has been a widely reported memory leak in the latest version of GNOME Shell which upstream GNOME have been working on fixing....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Ok so the number of classes is not the issue.
How do you extract input features from your text data? Using a TFIDF vectorizer or similar? How many features are generated if so?
There is not support for sparse input data yet in
HistGradientBoostingClassifier
so I suspect that the binning of a sparse matrix of text features might cause a memory explosion.We recently fixed #18152 in master and I suspect that this was the cause of your problem @allanchua101. Can you please confirm by building scikit-learn from the current master source code or by trying our nightly builds:
https://scikit-learn.org/stable/developers/advanced_installation.html#installing-nightly-builds
Let me close this for now. Please feel free to open a new issue with updated memory usage numbers if you still observer large memory usage with the current master on your data.