LogisticRegression memory consumption goes crazy on 0.22+
See original GitHub issueDescribe the bug
LogisticRegression started to consume crazy amounts of RAM on 0.22+.
Steps/Code to Reproduce
import pandas as pd
import numpy as np
import io
import requests
from io import StringIO
import sklearn
import sklearn.ensemble
import sklearn.metrics
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import *
from sklearn.utils import shuffle
url = "https://storage.googleapis.com/tensorflow-workshop-examples/stack-overflow-data.csv" # .csv file location
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')))
df = df[pd.notnull(df['tags'])]
df = df.sample(frac=0.5, random_state=99).reset_index(drop=True)
df = shuffle(df, random_state=22)
df = df.reset_index(drop=True)
df['class_label'] = df['tags'].factorize()[0]
df_train, df_test = train_test_split(df, test_size=0.2, random_state=40)
X_train = df_train["post"].tolist()
X_test = df_test["post"].tolist()
y_train = df_train["class_label"].tolist()
y_test = df_test["class_label"].tolist()
vectorizer = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}', ngram_range=(1, 3), stop_words = 'english', binary=True)
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)
logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg.fit(train_vectors, y_train)
Expected Results
0.22+ memory consumption should stay reasonably the same.
Actual Results
0.22+ behavior (tried 0.22.0, 0.22.1, 0.22.2.post1):
If run inside a container with limited memory (1-2 GB), the code crashes (by OOM Killer).
Locally, top -o mem
shows memory consumption growth to 9GB and continues increasing.
0.21.3 behavior:
Everything works fine within a 1GB container.
top -o mem
locally never shows past 1GB memory consumption.
Versions
System: python: 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0] executable: /usr/bin/python3 machine: Linux-3.10.0-957.21.3.el7.x86_64-x86_64-with-Ubuntu-18.04-bionic
Python dependencies: pip: 19.3.1 setuptools: 44.0.0 sklearn: 0.22.1 numpy: 1.18.1 scipy: 1.4.1 Cython: None pandas: 0.25.3 matplotlib: 3.1.2 joblib: 0.14.1
Built with OpenMP: True
Issue Analytics
- State:
- Created 3 years ago
- Comments:18 (17 by maintainers)
Top GitHub Comments
If this memory consumption is a problem in the lbfgs solver of scipy, should we open an issue upstream in scipy?
@rth I’ll look into this!