Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory leak in fit of Logistic Regression using KFold

See original GitHub issue

Describe the bug

It has been noticed that some memory is leaked during the training phase of Logistic Regression. I generate the dataset with make_classification and then use KFold.split

Steps/Code to Reproduce

Example:

import numpy as np
import tracemalloc

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold

def gen_clsf_data():
    data, label = make_classification(
        n_samples=5, n_features=5, random_state=777)
    return data, label, \
        data.size * data.dtype.itemsize + label.size * label.dtype.itemsize


x, y, data_memory_size = gen_clsf_data()
x = np.ascontiguousarray(x)

kf = KFold(n_splits=2)

tracemalloc.start()
for train_index, test_index in kf.split(x):
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    mem_before, _ = tracemalloc.get_traced_memory()    
    alg = LogisticRegression()
    alg.fit(x_train, y_train)
    mem_after, _ = tracemalloc.get_traced_memory()
    
    mem_diff = mem_after - mem_before
    assert mem_diff < 0.25 * data_memory_size, \
        'Size of extra allocated memory is greater than 25% of input data:' \
        f'\n\tInput data size: {data_memory_size} bytes' \
        f'\n\tExtra allocated memory size: {mem_diff} bytes' \
        f' / {round((mem_diff) / data_memory_size * 100, 2)} %'
tracemalloc.stop()

Expected Results

I guess that memory usage should be less than 25% of the input dataset.

Actual Results

Input data size: 240 bytes Extra allocated memory size: 103117 bytes / 42965.42 %

Versions

scikit-learn == 0.24.2

pip install scikit-learn

Issue Analytics

State:
Created 2 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

OnlyDenikocommented, Jun 17, 2021

Thanks for the answer. It became clear to me. It seems to me that the issue can be closed

0reactions

thomasjpfancommented, Jun 20, 2021

Thank you for the help with debugging this memory issue @i-aki-y !

Top Results From Across the Web

How to Avoid Data Leakage When Performing Data Preparation

In this section, we will evaluate a logistic regression model using k-fold cross-validation on a synthetic binary classification dataset ...

10. Common pitfalls and recommended practices - Scikit-learn

Data leakage occurs when information that would not be available at prediction time is used when building the model. This results in overly...

Scikit Learn Logistic Regression Memory Leak

I'm curious if anyone else has run into this. I have a data set with about 350k samples, each with 4k sparse features....

From Logistic Regression to CNN | Kaggle

def scores_cv(model): kf = KFold(5, shuffle = True, ... This can be caused by a too short worker timeout or by a memory...

Machine Learning (Natural Language Processing - NLP)

In this article, we are going to train a logistic regression model for document classification. ... Next, using 5-fold stratified cross-validation, ...