Memory leak in fit of Logistic Regression using KFold
See original GitHub issueDescribe the bug
It has been noticed that some memory is leaked during the training phase of Logistic Regression. I generate the dataset with make_classification and then use KFold.split
Steps/Code to Reproduce
Example:
import numpy as np
import tracemalloc
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
def gen_clsf_data():
data, label = make_classification(
n_samples=5, n_features=5, random_state=777)
return data, label, \
data.size * data.dtype.itemsize + label.size * label.dtype.itemsize
x, y, data_memory_size = gen_clsf_data()
x = np.ascontiguousarray(x)
kf = KFold(n_splits=2)
tracemalloc.start()
for train_index, test_index in kf.split(x):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
mem_before, _ = tracemalloc.get_traced_memory()
alg = LogisticRegression()
alg.fit(x_train, y_train)
mem_after, _ = tracemalloc.get_traced_memory()
mem_diff = mem_after - mem_before
assert mem_diff < 0.25 * data_memory_size, \
'Size of extra allocated memory is greater than 25% of input data:' \
f'\n\tInput data size: {data_memory_size} bytes' \
f'\n\tExtra allocated memory size: {mem_diff} bytes' \
f' / {round((mem_diff) / data_memory_size * 100, 2)} %'
tracemalloc.stop()
Expected Results
I guess that memory usage should be less than 25% of the input dataset.
Actual Results
Input data size: 240 bytes Extra allocated memory size: 103117 bytes / 42965.42 %
Versions
scikit-learn == 0.24.2
pip install scikit-learn
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (5 by maintainers)
Top Results From Across the Web
How to Avoid Data Leakage When Performing Data Preparation
In this section, we will evaluate a logistic regression model using k-fold cross-validation on a synthetic binary classification dataset ...
Read more >10. Common pitfalls and recommended practices - Scikit-learn
Data leakage occurs when information that would not be available at prediction time is used when building the model. This results in overly...
Read more >Scikit Learn Logistic Regression Memory Leak
I'm curious if anyone else has run into this. I have a data set with about 350k samples, each with 4k sparse features....
Read more >From Logistic Regression to CNN | Kaggle
def scores_cv(model): kf = KFold(5, shuffle = True, ... This can be caused by a too short worker timeout or by a memory...
Read more >Machine Learning (Natural Language Processing - NLP)
In this article, we are going to train a logistic regression model for document classification. ... Next, using 5-fold stratified cross-validation, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks for the answer. It became clear to me. It seems to me that the issue can be closed
Thank you for the help with debugging this memory issue @i-aki-y !