Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LogisticRegression with SAGA using sample_weight does not converge

See original GitHub issue

Describe the bug

I am fitting a logistic regression model on a sparse matrix and a binary response. Many of the rows of the matrix are repeated, so to speed things up I switched to a smaller sparse matrix with non-repeated rows and use the repetitions of the rows to calculate the sample_weight argument to fit.

The issue is that when I work with the weighted model, fit produces a warning that it dit not converge because it took too many steps.

I looked a bit under the hood of fit and sag_solver does some scaling of alpha and beta using n_samples. The resulting alpha_scaled and beta_scaled are different between the weighted and unweighted cases and they should not be (the loss function is the same). Perhaps the equivalent scaling for the weighted case should be the sum of the weights (if the intended ‘unit’ of the weights is ‘count’) and not just n_samples. Not sure if this is the issue, but it just it made me worry that the sample_weight argument is used in a bit naive way just as a multiplier for the loss function, while there might be scaling implications that are not accounted for when deciding when to stop.

UPDATE: It seems that the issue is with the SAGA solver. I tried with liblinear and it seems to work. This will solve my immediate problem, because for now I only want the L1-reg. It is still good to look into this, because at the moment only SAGA offers elasticnet.

Steps/Code to Reproduce

vttrifonov/logistic_sample_weights.ipynb

Expected Results

In the above code I expect the second fit to run much faster than the first and to produce the same coefficients.

Actual Results

It fails in both.

Versions

System: python: 3.7.0 (default, Jun 28 2018, 07:39:16) [Clang 4.0.1 (tags/RELEASE_401/final)] executable: /Users/vtrifonov/projects/tiny-proteins/env/bin/python machine: Darwin-19.6.0-x86_64-i386-64bit

Python dependencies: pip: 21.2.2 setuptools: 58.0.4 sklearn: 0.24.2 numpy: 1.20.3 scipy: 1.7.1 Cython: None pandas: 1.3.2 matplotlib: 3.4.2 joblib: 1.0.1 threadpoolctl: 2.2.0

Built with OpenMP: True

Issue Analytics

State:
Created 2 years ago
Comments:11 (8 by maintainers)

Top GitHub Comments

1reaction

agramfortcommented, Oct 27, 2021

I think the right fix is to be able to sample taking into account the sample_weights with a probability of w_j / sum_j w_j for feature j. It’s however not that easy to do this very efficiently. Maybe you can find code to do this on the original code of Mark Schmidt (ping @fabianp @RemiLeblond thoughts?)

0reactions

TomDLTcommented, Nov 4, 2021

I remember Mark Schmidt saying that SAG is particularly fast with importance sampling based on the Lipschitz constant of each sample. In his implementation, he uses a binary tree to perform the importance sampling. His code is rather complicated by the fact that he updates the Lipschitz constant on the fly with a line search. If we use importance sampling with fixed sample weights, a simple binary tree will do. I am not sure there is a more efficient way.

Top Results From Across the Web

How to fix non-convergence in LogisticRegressionCV

Apply StandardScaler() first, and then LogisticRegressionCV(penalty='l1', max_iter=5000, solver='saga') , may solve the issue. Using L1 penalty ...

sklearn.linear_model.LogisticRegression

'sag' and 'saga' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler...

python - Error in using accuracy_score from sklearn in Logistic ...

What you want to do is to make the predictions for the X2_test data and compare that to the ground truth y2_test ....

Source code for sklearn.linear_model._logistic - Diffprivlib

(all_penalties, penalty) ) if solver not in ["liblinear", "saga"] and penalty not ... with sample_weight (passed through the fit method) if sample_weight is...

Detection and Analysis of Credit Card Application Fraud Using ...

Multiple models, such as logistic regression and decision trees, are built and ... It does not come from any real credit card applications...