question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

HistGradientBoostingClassifier causing memory leak in Ubuntu

See original GitHub issue

In order to perform appropriate model(s) selection, I’m performing KFOLD cross validation across 11 models and HistGradientBoostingClassifier is causing memory leak and freezes my machine 😿

from sklearn.model_selection import cross_val_score
from sklearn.experimental import enable_hist_gradient_boosting 
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

models = [
    #DecisionTreeClassifier(max_depth=16),
    #MLPClassifier(alpha=1, max_iter=1000),
    #AdaBoostClassifier(),
    #KNeighborsClassifier(3),
    #GaussianNB(),
    HistGradientBoostingClassifier(random_state=0, min_samples_leaf=20, loss='categorical_crossentropy'),
    #GradientBoostingClassifier(random_state=0),
    #MultinomialNB(),
    #LogisticRegression(random_state=0, solver='lbfgs', multi_class='auto'),
    #RandomForestClassifier(n_estimators=500, max_depth=32, random_state=0),
    #SVC(kernel='linear', probability=True, gamma='auto', random_state=0),
]

CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))

entries = []
for model in models:
  model_name = model.__class__.__name__
  accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
  for fold_idx, accuracy in enumerate(accuracies):
    print(f'{model_name} - {fold_idx} {accuracy:.4f}')
    entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])

OS Description

Distributor ID: Ubuntu Description: Ubuntu 19.10 Release: 19.10

Python Ecosystem Details

conda 4.8.1 Python 3.6.6 :: Anaconda, Inc. Conda Environment Details

channels:
  - anaconda
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - anaconda-client=1.7.2=py36_0
  - anaconda-navigator=1.9.7=py36_0
  - attrs=19.3.0=py_0
  - backcall=0.1.0=py36_0
  - blas=1.0=mkl
  - bleach=3.1.0=py36_0
  - ca-certificates=2020.1.1=0
  - certifi=2019.11.28=py36_0
  - chardet=3.0.4=py36_1003
  - clyent=1.2.2=py36_1
  - cycler=0.10.0=py36_0
  - dbus=1.13.12=h746ee38_0
  - decorator=4.4.1=py_0
  - defusedxml=0.6.0=py_0
  - entrypoints=0.3=py36_0
  - expat=2.2.6=he6710b0_0
  - fontconfig=2.13.0=h9420a91_0
  - freetype=2.9.1=h8a8886c_1
  - glib=2.56.2=hd408876_0
  - gmp=6.1.2=h6c8ec71_1
  - gst-plugins-base=1.14.0=hbbd80ab_1
  - gstreamer=1.14.0=hb453b48_1
  - icu=58.2=h211956c_0
  - importlib_metadata=1.4.0=py36_0
  - intel-openmp=2019.4=243
  - ipykernel=5.1.4=py36h39e3cac_0
  - ipython=7.11.1=py36h39e3cac_0
  - ipython_genutils=0.2.0=py36_0
  - jedi=0.16.0=py36_0
  - jinja2=2.10.3=py_0
  - joblib=0.14.1=py_0
  - jpeg=9b=habf39ab_1
  - jsonschema=3.2.0=py36_0
  - jupyter_client=5.3.4=py36_0
  - jupyter_core=4.6.1=py36_0
  - kiwisolver=1.1.0=py36he6710b0_0
  - libedit=3.1.20181209=hc058e9b_0
  - libffi=3.2.1=hd88cf55_4
  - libgcc-ng=9.1.0=hdf63c60_0
  - libgfortran-ng=7.3.0=hdf63c60_0
  - libpng=1.6.37=hbc83047_0
  - libsodium=1.0.16=h1bed415_0
  - libstdcxx-ng=9.1.0=hdf63c60_0
  - libtiff=4.1.0=h2733197_0
  - libuuid=1.0.3=h1bed415_2
  - libxcb=1.13=h1bed415_1
  - libxml2=2.9.9=hea5a465_1
  - markupsafe=1.1.1=py36h7b6447c_0
  - matplotlib=3.1.3=py36_0
  - matplotlib-base=3.1.3=py36hef1b27d_0
  - mistune=0.8.4=py36h7b6447c_0
  - mkl=2019.4=243
  - mkl-service=2.3.0=py36he904b0f_0
  - mkl_fft=1.0.15=py36ha843d7b_0
  - mkl_random=1.1.0=py36hd6b4f25_0
  - more-itertools=8.0.2=py_0
  - nbconvert=5.6.1=py36_0
  - nbformat=5.0.4=py_0
  - ncurses=6.1=he6710b0_1
  - notebook=6.0.3=py36_0
  - numpy=1.18.1=py36h4f9e942_0
  - numpy-base=1.18.1=py36hde5b4d6_1
  - olefile=0.46=py36_0
  - openssl=1.0.2u=h7b6447c_0
  - pandas=1.0.0=py36h0573a6f_0
  - pandoc=2.2.3.2=0
  - pandocfilters=1.4.2=py36_1
  - parso=0.6.0=py_0
  - pcre=8.43=he6710b0_0
  - pexpect=4.8.0=py36_0
  - pickleshare=0.7.5=py36_0
  - pillow=7.0.0=py36hb39fc2d_0
  - pip=20.0.2=py36_1
  - prometheus_client=0.7.1=py_0
  - prompt_toolkit=3.0.3=py_0
  - psutil=5.6.7=py36h7b6447c_0
  - ptyprocess=0.6.0=py36_0
  - pygments=2.5.2=py_0
  - pyparsing=2.4.6=py_0
  - pyqt=5.9.2=py36h22d08a2_1
  - pyrsistent=0.15.7=py36h7b6447c_0
  - python=3.6.6=h6e4f718_2
  - python-dateutil=2.8.1=py_0
  - pytz=2019.3=py_0
  - pyyaml=5.2=py36h7b6447c_0
  - pyzmq=18.1.0=py36he6710b0_0
  - qt=5.9.6=h8703b6f_2
  - qtpy=1.9.0=py_0
  - readline=7.0=h7b6447c_5
  - requests=2.14.2=py36_0
  - scikit-learn=0.22.1=py36hd81dba3_0
  - scipy=1.4.1=py36h0b6359f_0
  - seaborn=0.10.0=py_0
  - send2trash=1.5.0=py36_0
  - setuptools=45.1.0=py36_0
  - sip=4.19.13=py36he6710b0_0
  - six=1.14.0=py36_0
  - sqlite=3.30.1=h7b6447c_0
  - terminado=0.8.3=py36_0
  - testpath=0.4.4=py_0
  - tk=8.6.8=hbc83047_0
  - tornado=6.0.3=py36h7b6447c_0
  - traitlets=4.3.3=py36_0
  - wcwidth=0.1.7=py36_0
  - webencodings=0.5.1=py36_1
  - wheel=0.34.1=py36_0
  - xz=5.2.4=h14c3975_4
  - yaml=0.1.7=h96e3832_1
  - zeromq=4.3.1=he6710b0_3
  - zipp=0.6.0=py_0
  - zlib=1.2.11=h7b6447c_3
  - zstd=1.3.7=h0b5b093_0

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:12 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
ogriselcommented, Feb 6, 2020

Ok so the number of classes is not the issue.

How do you extract input features from your text data? Using a TFIDF vectorizer or similar? How many features are generated if so?

There is not support for sparse input data yet in HistGradientBoostingClassifier so I suspect that the binning of a sparse matrix of text features might cause a memory explosion.

0reactions
ogriselcommented, Sep 4, 2020

We recently fixed #18152 in master and I suspect that this was the cause of your problem @allanchua101. Can you please confirm by building scikit-learn from the current master source code or by trying our nightly builds:

https://scikit-learn.org/stable/developers/advanced_installation.html#installing-nightly-builds

Let me close this for now. Please feel free to open a new issue with updated memory usage numbers if you still observer large memory usage with the current master on your data.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Does Ubuntu has memory leaks?
It seems Microsoft VSCode ( /usr/share/code/code ) has an extremely large "Virtual Memory Map", as well as multiple spawned processes.
Read more >
Bug #1991709 “gnome-shell memory leak (when Ubuntu ...
gnome-shell memory leak (when Ubuntu AppIndicators is enabled). Bug #1991709 reported by Sergiu on 2022-10-04. 8. This bug affects 1 person ...
Read more >
Memory leak using gridsearchcv - scikit learn - Stack Overflow
The cause of my issue was that i put n_jobs=-1 in gridsearchcv, when it should be placed in the classifier. This has solved...
Read more >
Slab memory leak on Ubuntu Xenial stemcells
If you have a foundation with symptoms detailed above then please raise a ticket with Tanzu support. If the problem began after a...
Read more >
Help test memory leak fixes in 18.04 LTS - Ubuntu Discourse
There has been a widely reported memory leak in the latest version of GNOME Shell which upstream GNOME have been working on fixing....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found