Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

HistGradientBoostingClassifier causing memory leak in Ubuntu

See original GitHub issue

In order to perform appropriate model(s) selection, I’m performing KFOLD cross validation across 11 models and HistGradientBoostingClassifier is causing memory leak and freezes my machine 😿

from sklearn.model_selection import cross_val_score
from sklearn.experimental import enable_hist_gradient_boosting 
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

models = [
    #DecisionTreeClassifier(max_depth=16),
    #MLPClassifier(alpha=1, max_iter=1000),
    #AdaBoostClassifier(),
    #KNeighborsClassifier(3),
    #GaussianNB(),
    HistGradientBoostingClassifier(random_state=0, min_samples_leaf=20, loss='categorical_crossentropy'),
    #GradientBoostingClassifier(random_state=0),
    #MultinomialNB(),
    #LogisticRegression(random_state=0, solver='lbfgs', multi_class='auto'),
    #RandomForestClassifier(n_estimators=500, max_depth=32, random_state=0),
    #SVC(kernel='linear', probability=True, gamma='auto', random_state=0),
]

CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))

entries = []
for model in models:
  model_name = model.__class__.__name__
  accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
  for fold_idx, accuracy in enumerate(accuracies):
    print(f'{model_name} - {fold_idx} {accuracy:.4f}')
    entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])

OS Description

Distributor ID: Ubuntu Description: Ubuntu 19.10 Release: 19.10

Python Ecosystem Details

conda 4.8.1 Python 3.6.6 :: Anaconda, Inc. Conda Environment Details

channels:
  - anaconda
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - anaconda-client=1.7.2=py36_0
  - anaconda-navigator=1.9.7=py36_0
  - attrs=19.3.0=py_0
  - backcall=0.1.0=py36_0
  - blas=1.0=mkl
  - bleach=3.1.0=py36_0
  - ca-certificates=2020.1.1=0
  - certifi=2019.11.28=py36_0
  - chardet=3.0.4=py36_1003
  - clyent=1.2.2=py36_1
  - cycler=0.10.0=py36_0
  - dbus=1.13.12=h746ee38_0
  - decorator=4.4.1=py_0
  - defusedxml=0.6.0=py_0
  - entrypoints=0.3=py36_0
  - expat=2.2.6=he6710b0_0
  - fontconfig=2.13.0=h9420a91_0
  - freetype=2.9.1=h8a8886c_1
  - glib=2.56.2=hd408876_0
  - gmp=6.1.2=h6c8ec71_1
  - gst-plugins-base=1.14.0=hbbd80ab_1
  - gstreamer=1.14.0=hb453b48_1
  - icu=58.2=h211956c_0
  - importlib_metadata=1.4.0=py36_0
  - intel-openmp=2019.4=243
  - ipykernel=5.1.4=py36h39e3cac_0
  - ipython=7.11.1=py36h39e3cac_0
  - ipython_genutils=0.2.0=py36_0
  - jedi=0.16.0=py36_0
  - jinja2=2.10.3=py_0
  - joblib=0.14.1=py_0
  - jpeg=9b=habf39ab_1
  - jsonschema=3.2.0=py36_0
  - jupyter_client=5.3.4=py36_0
  - jupyter_core=4.6.1=py36_0
  - kiwisolver=1.1.0=py36he6710b0_0
  - libedit=3.1.20181209=hc058e9b_0
  - libffi=3.2.1=hd88cf55_4
  - libgcc-ng=9.1.0=hdf63c60_0
  - libgfortran-ng=7.3.0=hdf63c60_0
  - libpng=1.6.37=hbc83047_0
  - libsodium=1.0.16=h1bed415_0
  - libstdcxx-ng=9.1.0=hdf63c60_0
  - libtiff=4.1.0=h2733197_0
  - libuuid=1.0.3=h1bed415_2
  - libxcb=1.13=h1bed415_1
  - libxml2=2.9.9=hea5a465_1
  - markupsafe=1.1.1=py36h7b6447c_0
  - matplotlib=3.1.3=py36_0
  - matplotlib-base=3.1.3=py36hef1b27d_0
  - mistune=0.8.4=py36h7b6447c_0
  - mkl=2019.4=243
  - mkl-service=2.3.0=py36he904b0f_0
  - mkl_fft=1.0.15=py36ha843d7b_0
  - mkl_random=1.1.0=py36hd6b4f25_0
  - more-itertools=8.0.2=py_0
  - nbconvert=5.6.1=py36_0
  - nbformat=5.0.4=py_0
  - ncurses=6.1=he6710b0_1
  - notebook=6.0.3=py36_0
  - numpy=1.18.1=py36h4f9e942_0
  - numpy-base=1.18.1=py36hde5b4d6_1
  - olefile=0.46=py36_0
  - openssl=1.0.2u=h7b6447c_0
  - pandas=1.0.0=py36h0573a6f_0
  - pandoc=2.2.3.2=0
  - pandocfilters=1.4.2=py36_1
  - parso=0.6.0=py_0
  - pcre=8.43=he6710b0_0
  - pexpect=4.8.0=py36_0
  - pickleshare=0.7.5=py36_0
  - pillow=7.0.0=py36hb39fc2d_0
  - pip=20.0.2=py36_1
  - prometheus_client=0.7.1=py_0
  - prompt_toolkit=3.0.3=py_0
  - psutil=5.6.7=py36h7b6447c_0
  - ptyprocess=0.6.0=py36_0
  - pygments=2.5.2=py_0
  - pyparsing=2.4.6=py_0
  - pyqt=5.9.2=py36h22d08a2_1
  - pyrsistent=0.15.7=py36h7b6447c_0
  - python=3.6.6=h6e4f718_2
  - python-dateutil=2.8.1=py_0
  - pytz=2019.3=py_0
  - pyyaml=5.2=py36h7b6447c_0
  - pyzmq=18.1.0=py36he6710b0_0
  - qt=5.9.6=h8703b6f_2
  - qtpy=1.9.0=py_0
  - readline=7.0=h7b6447c_5
  - requests=2.14.2=py36_0
  - scikit-learn=0.22.1=py36hd81dba3_0
  - scipy=1.4.1=py36h0b6359f_0
  - seaborn=0.10.0=py_0
  - send2trash=1.5.0=py36_0
  - setuptools=45.1.0=py36_0
  - sip=4.19.13=py36he6710b0_0
  - six=1.14.0=py36_0
  - sqlite=3.30.1=h7b6447c_0
  - terminado=0.8.3=py36_0
  - testpath=0.4.4=py_0
  - tk=8.6.8=hbc83047_0
  - tornado=6.0.3=py36h7b6447c_0
  - traitlets=4.3.3=py36_0
  - wcwidth=0.1.7=py36_0
  - webencodings=0.5.1=py36_1
  - wheel=0.34.1=py36_0
  - xz=5.2.4=h14c3975_4
  - yaml=0.1.7=h96e3832_1
  - zeromq=4.3.1=he6710b0_3
  - zipp=0.6.0=py_0
  - zlib=1.2.11=h7b6447c_3
  - zstd=1.3.7=h0b5b093_0

Issue Analytics

State:
Created 4 years ago
Comments:12 (8 by maintainers)

Top GitHub Comments

1reaction

ogriselcommented, Feb 6, 2020

Ok so the number of classes is not the issue.

How do you extract input features from your text data? Using a TFIDF vectorizer or similar? How many features are generated if so?

There is not support for sparse input data yet in HistGradientBoostingClassifier so I suspect that the binning of a sparse matrix of text features might cause a memory explosion.

0reactions

ogriselcommented, Sep 4, 2020

We recently fixed #18152 in master and I suspect that this was the cause of your problem @allanchua101. Can you please confirm by building scikit-learn from the current master source code or by trying our nightly builds:

https://scikit-learn.org/stable/developers/advanced_installation.html#installing-nightly-builds

Let me close this for now. Please feel free to open a new issue with updated memory usage numbers if you still observer large memory usage with the current master on your data.