question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory leak in fit of Logistic Regression using KFold

See original GitHub issue

Describe the bug

It has been noticed that some memory is leaked during the training phase of Logistic Regression. I generate the dataset with make_classification and then use KFold.split

Steps/Code to Reproduce

Example:

import numpy as np
import tracemalloc

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold

def gen_clsf_data():
    data, label = make_classification(
        n_samples=5, n_features=5, random_state=777)
    return data, label, \
        data.size * data.dtype.itemsize + label.size * label.dtype.itemsize


x, y, data_memory_size = gen_clsf_data()
x = np.ascontiguousarray(x)

kf = KFold(n_splits=2)

tracemalloc.start()
for train_index, test_index in kf.split(x):
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    mem_before, _ = tracemalloc.get_traced_memory()    
    alg = LogisticRegression()
    alg.fit(x_train, y_train)
    mem_after, _ = tracemalloc.get_traced_memory()
    
    mem_diff = mem_after - mem_before
    assert mem_diff < 0.25 * data_memory_size, \
        'Size of extra allocated memory is greater than 25% of input data:' \
        f'\n\tInput data size: {data_memory_size} bytes' \
        f'\n\tExtra allocated memory size: {mem_diff} bytes' \
        f' / {round((mem_diff) / data_memory_size * 100, 2)} %'
tracemalloc.stop()

Expected Results

I guess that memory usage should be less than 25% of the input dataset.

Actual Results

Input data size: 240 bytes Extra allocated memory size: 103117 bytes / 42965.42 %

Versions

scikit-learn == 0.24.2

pip install scikit-learn

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
OnlyDenikocommented, Jun 17, 2021

Thanks for the answer. It became clear to me. It seems to me that the issue can be closed

0reactions
thomasjpfancommented, Jun 20, 2021

Thank you for the help with debugging this memory issue @i-aki-y !

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Avoid Data Leakage When Performing Data Preparation
In this section, we will evaluate a logistic regression model using k-fold cross-validation on a synthetic binary classification dataset ...
Read more >
10. Common pitfalls and recommended practices - Scikit-learn
Data leakage occurs when information that would not be available at prediction time is used when building the model. This results in overly...
Read more >
Scikit Learn Logistic Regression Memory Leak
I'm curious if anyone else has run into this. I have a data set with about 350k samples, each with 4k sparse features....
Read more >
From Logistic Regression to CNN | Kaggle
def scores_cv(model): kf = KFold(5, shuffle = True, ... This can be caused by a too short worker timeout or by a memory...
Read more >
Machine Learning (Natural Language Processing - NLP)
In this article, we are going to train a logistic regression model for document classification. ... Next, using 5-fold stratified cross-validation, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found