Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GridSearchCV.fit() causes memory leak

See original GitHub issue

Description

When I run GridSearchCV.fit() it causes a memory leak that eventually consumes all physical memory (16 GB) and causes my python process to be killed.

Steps/Code to Reproduce

Create a python file that contains the code below
Download this csv file Churn_Modelling.csv.txt, remove the .txt extension (I had to add that to upload the file to github), and put it in the same folder as the python file from step 1
Run the code with python
Watch the python process’s memory consumption continually increase and never decrease

import numpy as np
import pandas as pd

# Import data set
dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values

# Encode categorical data
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

ct = ColumnTransformer(
    [
        ('Country',
         OneHotEncoder(),
         [1]),
        ('Gender',
         OrdinalEncoder(),
         [2])
    ],
    remainder='passthrough')

X = ct.fit_transform(X)
X = X[:, 1:]

# Split dataset into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Feature scaling
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Build ANN
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import GridSearchCV

def build_classifier(optimizer):
    classifier = Sequential()
    classifier.add(Dense(units=6, kernel_initializer='uniform', activation='relu', input_dim=11))
    classifier.add(Dense(units=6, kernel_initializer='uniform', activation='relu'))
    classifier.add(Dense(units=1, kernel_initializer='uniform', activation='sigmoid'))
    classifier.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
    return classifier

classifier = KerasClassifier(build_fn=build_classifier)

# Tune the ANN
parameters = {'batch_size':[25, 32], 'epochs':[100, 500], 'optimizer':['adam', 'rmsprop']}
grid_search = GridSearchCV(estimator=classifier, param_grid=parameters, scoring='accuracy', cv=10)
grid_search = grid_search.fit(X_train, y_train)
best_parameters = grid_search.best_params_
best_accuracy = grid_search.best_score_

Expected Results

Code runs to completion without leaking memory

Actual Results

The call to GridSearchCv.fit() begins running the epochs and consumes all available RAM (16 GB total) causing the python process running the code to be killed before the code can finish running

Versions

System: python: 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 21:52:21) [GCC 7.3.0] executable: /home/user/miniconda3/envs/super_ds/bin/python machine: Linux-5.2.0-2-amd64-x86_64-with-debian-bullseye-sid

Python deps: pip: 19.2.3 setuptools: 41.2.0 sklearn: 0.21.3 numpy: 1.17.2 scipy: 1.3.1 Cython: None pandas: 0.25.1

Keras: keras: 2.3.0 keras-applications: 1.0.8 keras-preprocessing: 1.1.0

Issue Analytics

State:
Created 4 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

6reactions

bgx90commented, Sep 23, 2019

@rth is correct. This is issue is caused by Tensorflow. After changing the Keras backend to Theano, my original code using GridSearchCV ran with memory consumption staying at 368 MB for the process. Likewise, the memory consumption for the code with nested loops stayed under 329 MB. After seeing this, I did some investigation into Tensorflow and found that the memory consumption I was seeing is not an unintentional memory leak; it is part of Tensorflow’s computational model. Tensorflow is lazily evaluated; it builds a dataflow graph to be evaluated in order to build and execution plan. When a model is created, Tensorflow adds the model’s nodes to the dataflow graph. So even though I deleted the references to model objects in my python code, the models were still stored in the dataflow graph. I read several sources that said that the memory could be freed by importing backend from keras and calling clear_session like this:

import gc
from keras import backend as K
scores = []
for ep in parameters['epochs']:
    for bs in parameters['batch_size']:
        for opt in parameters['optimizer']:
            classifier = KerasClassifier(build_fn=build_classifier, optimizer=opt, batch_size=bs, epochs=ep)
            score = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=10)
            K.clear_session()
            del classifier
            gc.collect()
            scores.append(score)

However, this had no effect, and I have seen several reports that say clear_session() is not working for other people as well. Even if clear_session() was working, it would not help the situation when using GridSearchCV unless GridSearchCV methods had an option to call clear_session() at certain times. So the final solution is a to use a backend other than Tensorflow for Keras when tuning hyperparameters with GridSearchCV.

0reactions

rthcommented, Sep 23, 2019

Could be an issue in Tensorflow (used by Keras). In general, if Tensorflow allocates some memory in C++ and doesn’t release it, Python garbage collector can’t do much about it.

One thing to try could be to change the Keras backend to CNTK or Theano and see if it’s still reproducible. In any case it’s likely that this is unrelated to scikit-learn.

Top Results From Across the Web

Memory leak using gridsearchcv - scikit learn - Stack Overflow

The cause of my issue was that i put n_jobs=-1 in gridsearchcv, when it should be placed in the classifier. This has solved...

GridSearchCV and memory - Dane DeSutter

I'm running macOS Catalina 10.15 and others on GitHub indicated a similar issue on this OS. GridSearchCV attempts to multithread parameter ...

10. Common pitfalls and recommended practices - Scikit-learn

The general rule is to never call fit on the test data. While this may sound obvious, ... An example of data leakage...

Grid Search versus Random Grid Search - Kaggle

from sklearn.model_selection import GridSearchCV Grid_Search = GridSearchCV ... This can be caused by a too short worker timeout or by a memory leak....

scikit-learn n_jobs parameter on CPU usage & memory

I can imagine a value of -1 consumes all available resources as and when they become available. Depending on which function you are...