question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SGDRegressor gets poor fit with sparse matrix

See original GitHub issue

If a sparse matrix is passed into the function, it doesn’t throw any error but gets a poor fit sometimes. I think it should either throw an error when the parameter is a sparse matrix or convert it into a dense matrix like https://github.com/scikit-learn/scikit-learn/pull/535

Below is some code to compare the differences. (It fits well on boston housing price even when the parameter is a sparse matrix but fails on diabetes.)

import numpy
from scipy.sparse import csr_matrix
from sklearn.datasets import load_diabetes
from sklearn.linear_model import SGDRegressor, LinearRegression
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

diabetes = load_diabetes()
X = diabetes.data
# X = csr_matrix(X)  # Uncomment this line to use a sparse matrix
y = numpy.asarray(diabetes.target)

scaler = StandardScaler(with_mean=False)
scaler.fit(X)
X = scaler.transform(X)

estimator = SGDRegressor()
# estimator = LinearRegression()  # LinearRegression works with sparse matrix
estimator.fit(X, y)
predicted = estimator.predict(X)

fig, ax = plt.subplots()
ax.scatter(y, predicted, edgecolors=(0, 0, 0))
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()

Just stating in the documentation that the input should be a dense matrix is not enough, it may cause surprises like this.

NumPy 1.14.2 SciPy 1.1.0 Scikit-Learn 0.19.1

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:24 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
jennaliucommented, Jun 1, 2018

http://scikit-learn.org/stable/modules/sgd.html#stochastic-gradient-descent-for-sparse-data

Note The sparse implementation produces slightly different results than the dense implementation due to a shrunk learning rate for the intercept.

But to me, that difference is huge. In order to get the same fit, I need to run many more iterations.

0reactions
SimonBenhamoucommented, Jan 8, 2021

@jennaliu Thanks for pointing that out. After weeks spent debugging, I now understand that I experience the same issue you described, for a classification problem with sparse CSR matrices. My intercept is way too low, which causes all kinds of issues. It must come from the intercept decay, which, like you, I don’t understand the purpose.

Is anyone still looking into this ?

Thanks, Simon

Read more comments on GitHub >

github_iconTop Results From Across the Web

[Scikit-learn-general] SGDRegressor for sparse matrix
Hi everyone, I am trying to use the SGDRegressor to solve for a sparse set of linear equations. I am getting an under/over-flow...
Read more >
python - ScikitLearn regression: Design matrix X too big for ...
SGDRegressor , allowing to fit only a mini-batch. It is what you are looking for. The process is: use the generator to yield...
Read more >
sklearn.linear_model.SGDRegressor
Fit linear model with Stochastic Gradient Descent. Parameters: X{array-like, sparse matrix}, shape (n_samples, n_features). Training ...
Read more >
8.15.2.3. sklearn.linear_model.sparse.SGDClassifier
Fit linear model with Stochastic Gradient Descent. fit_transform(X[, y]), Fit to data, then transform it. get_params([deep]), Get parameters for the ...
Read more >
Machine Learning & Deep Learning Guide | Analytics Vidhya
Supervised learning — Stochastic Gradient Descent (SGD) Regressor: ... We applied the fit only on the training set and not the test set....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found