question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scaling issues in l-bfgs for LogisticRegression

See original GitHub issue

So it looks like l-bfgs is very sensitive to scaling of the data, which can lead to convergence issues. I feel like we might be able to fix this by changing the framing of the optimization?

example:

from sklearn.datasets import fetch_openml
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import scale

data = fetch_openml(data_id=1590, as_frame=True)
cross_val_score(LogisticRegression(), pd.get_dummies(data.data), data.target)

this gives convergence warnings, after scaling it doesn’t. I have seen this in many places. While people should scale I think warning about number of iterations is not a good thing to show to the user. If we can fix this, I think we should.

Using the bank campaign data I got coefficients that were quite different if I increased the number of iterations (I got convergence warnings with the default of 100). If I scaled the data, that issue went away.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:4
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
amuellercommented, Dec 27, 2019

The fun never ends. Here’s a toy example from my book where liblinear is worse and gives qualitatively different results?!

import numpy as np
from sklearn.datasets import make_blobs

def make_forge():
    # a carefully hand-designed dataset lol
    X, y = make_blobs(centers=2, random_state=4, n_samples=30)
    y[np.array([7, 27])] = 0
    mask = np.ones(len(X), dtype=np.bool)
    mask[np.array([0, 1, 5, 26])] = 0
    X, y = X[mask], y[mask]
    return X, y

from sklearn.linear_model import LogisticRegression
X, y = make_forge()
lr1 = LogisticRegression(solver='liblinear').fit(X, y)
lr2 = LogisticRegression(solver='lbfgs').fit(X, y)

from sklearn.linear_model._logistic import _logistic_loss

print(_logistic_loss(np.hstack([lr1.coef_.ravel(), lr1.intercept_]), X, 2 * y - 1, 1))
print(_logistic_loss(np.hstack([lr2.coef_.ravel(), lr2.intercept_]), X, 2 * y - 1, 1))
7.283720321745406
5.501204675670945
print(lr1.coef_)
print(lr2.coef_)
[[-0.36631767  1.25909579]]
[[0.67289534 1.53136443]]

See https://github.com/amueller/introduction_to_ml_with_python/issues/124

1reaction
GaelVaroquauxcommented, Nov 9, 2019

I guess the standard line of sklearn would be to do exactly what the user told us to, so we should use a diagonal preconditioner.

+1: diagonal preconditioner, as we try to solve the canonical problem.

Good thinking @amueller, this will be useful!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Logistic regression and scaling of features - Cross Validated
I was under the belief that scaling of features should not affect the result of logistic regression. However, in the example below, ...
Read more >
Don't Sweat the Solver Stuff. Tips for Better Logistic ...
FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. ... There is no closed-form solution for logistic regression problems.
Read more >
Logistic Regression Using PyTorch with L-BFGS
Dr. James McCaffrey of Microsoft Research demonstrates applying the L-BFGS optimization algorithm to the ML logistic regression technique ...
Read more >
Do features need to be scaled in Logistic Regression?
thanks. I have two points. First, the documentation referred to in the answer says that lbfgs solver is robust to unscaled datasets. This...
Read more >
Scaling Multinomial Logistic Regression via Hybrid Parallelism
We study the problem of scaling Multinomial Logistic Regression ... two categories: (a) data parallel methods such as L-BFGS [17] which.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found