Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SGDRegressor gets poor fit with sparse matrix

See original GitHub issue

If a sparse matrix is passed into the function, it doesn’t throw any error but gets a poor fit sometimes. I think it should either throw an error when the parameter is a sparse matrix or convert it into a dense matrix like https://github.com/scikit-learn/scikit-learn/pull/535

Below is some code to compare the differences. (It fits well on boston housing price even when the parameter is a sparse matrix but fails on diabetes.)

import numpy
from scipy.sparse import csr_matrix
from sklearn.datasets import load_diabetes
from sklearn.linear_model import SGDRegressor, LinearRegression
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

diabetes = load_diabetes()
X = diabetes.data
# X = csr_matrix(X)  # Uncomment this line to use a sparse matrix
y = numpy.asarray(diabetes.target)

scaler = StandardScaler(with_mean=False)
scaler.fit(X)
X = scaler.transform(X)

estimator = SGDRegressor()
# estimator = LinearRegression()  # LinearRegression works with sparse matrix
estimator.fit(X, y)
predicted = estimator.predict(X)

fig, ax = plt.subplots()
ax.scatter(y, predicted, edgecolors=(0, 0, 0))
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()

Just stating in the documentation that the input should be a dense matrix is not enough, it may cause surprises like this.

NumPy 1.14.2 SciPy 1.1.0 Scikit-Learn 0.19.1

Issue Analytics

State:
Created 5 years ago
Comments:24 (13 by maintainers)

Top GitHub Comments

1reaction

jennaliucommented, Jun 1, 2018

http://scikit-learn.org/stable/modules/sgd.html#stochastic-gradient-descent-for-sparse-data

Note The sparse implementation produces slightly different results than the dense implementation due to a shrunk learning rate for the intercept.

But to me, that difference is huge. In order to get the same fit, I need to run many more iterations.

0reactions

SimonBenhamoucommented, Jan 8, 2021

@jennaliu Thanks for pointing that out. After weeks spent debugging, I now understand that I experience the same issue you described, for a classification problem with sparse CSR matrices. My intercept is way too low, which causes all kinds of issues. It must come from the intercept decay, which, like you, I don’t understand the purpose.

Is anyone still looking into this ?

Thanks, Simon

Top Results From Across the Web

[Scikit-learn-general] SGDRegressor for sparse matrix

Hi everyone, I am trying to use the SGDRegressor to solve for a sparse set of linear equations. I am getting an under/over-flow...

python - ScikitLearn regression: Design matrix X too big for ...

SGDRegressor , allowing to fit only a mini-batch. It is what you are looking for. The process is: use the generator to yield...

sklearn.linear_model.SGDRegressor

Fit linear model with Stochastic Gradient Descent. Parameters: X{array-like, sparse matrix}, shape (n_samples, n_features). Training ...

8.15.2.3. sklearn.linear_model.sparse.SGDClassifier

Fit linear model with Stochastic Gradient Descent. fit_transform(X[, y]), Fit to data, then transform it. get_params([deep]), Get parameters for the ...

Machine Learning & Deep Learning Guide | Analytics Vidhya

Supervised learning — Stochastic Gradient Descent (SGD) Regressor: ... We applied the fit only on the training set and not the test set....