question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LogisticRegression memory consumption goes crazy on 0.22+

See original GitHub issue

Describe the bug

LogisticRegression started to consume crazy amounts of RAM on 0.22+.

Steps/Code to Reproduce

import pandas as pd
import numpy as np
import io
import requests
from io import StringIO

import sklearn
import sklearn.ensemble
import sklearn.metrics
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import *
from sklearn.utils import shuffle


url = "https://storage.googleapis.com/tensorflow-workshop-examples/stack-overflow-data.csv" # .csv file location
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')))

df = df[pd.notnull(df['tags'])]
df = df.sample(frac=0.5, random_state=99).reset_index(drop=True)
df = shuffle(df, random_state=22)
df = df.reset_index(drop=True)
df['class_label'] = df['tags'].factorize()[0]


df_train, df_test = train_test_split(df, test_size=0.2, random_state=40)

X_train = df_train["post"].tolist()
X_test = df_test["post"].tolist()
y_train = df_train["class_label"].tolist()
y_test = df_test["class_label"].tolist()

vectorizer = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}', ngram_range=(1, 3), stop_words = 'english', binary=True)
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)

logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg.fit(train_vectors, y_train)

Expected Results

0.22+ memory consumption should stay reasonably the same.

Actual Results

0.22+ behavior (tried 0.22.0, 0.22.1, 0.22.2.post1):

If run inside a container with limited memory (1-2 GB), the code crashes (by OOM Killer).

Locally, top -o mem shows memory consumption growth to 9GB and continues increasing.

0.21.3 behavior:

Everything works fine within a 1GB container.

top -o mem locally never shows past 1GB memory consumption.

Versions

System: python: 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0] executable: /usr/bin/python3 machine: Linux-3.10.0-957.21.3.el7.x86_64-x86_64-with-Ubuntu-18.04-bionic

Python dependencies: pip: 19.3.1 setuptools: 44.0.0 sklearn: 0.22.1 numpy: 1.18.1 scipy: 1.4.1 Cython: None pandas: 0.25.3 matplotlib: 3.1.2 joblib: 0.14.1

Built with OpenMP: True

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:18 (17 by maintainers)

github_iconTop GitHub Comments

2reactions
lorentzenchrcommented, Jun 9, 2022

If this memory consumption is a problem in the lbfgs solver of scipy, should we open an issue upstream in scipy?

2reactions
rubywermancommented, Jul 1, 2020

@rth I’ll look into this!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scikit-Learn Logistic Regression Memory Error - Stack Overflow
I'm using Python 2.7, sklearn 0.11, Windows XP with 4 GB of RAM. Is the dataset you are using public?
Read more >
Logistic Regression for Rare Events - Statistical Horizons
I used logistic regression and result shows all 10 independent variables are highly significant. I tried rare event and got same result. People ......
Read more >
Why Do I Get Different Results Each Time in Machine Learning?
Stochastic machine learning algorithms use randomness during ... simpler algorithms like linear regression and logistic regression have a ...
Read more >
Mastering Machine Learning with scikit-learn
text, images, and categorical variables as features that can be used in machine learning models. Chapter 4, From Linear Regression to Logistic Regression, ......
Read more >
Don't Sweat the Solver Stuff. Tips for Better Logistic ...
Logistic regression is the bread-and-butter algorithm for machine ... Also, we're not looking at memory and speed requirements in these ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found