LogisticRegression with SAGA using sample_weight does not converge
See original GitHub issueDescribe the bug
I am fitting a logistic regression model on a sparse matrix and a binary response. Many of the rows of the matrix are repeated, so to speed things up I switched to a smaller sparse matrix with non-repeated rows and use the repetitions of the rows to calculate the sample_weight
argument to fit
.
The issue is that when I work with the weighted model, fit
produces a warning that it dit not converge because it took too many steps.
I looked a bit under the hood of fit
and sag_solver
does some scaling of alpha
and beta
using n_samples
. The resulting alpha_scaled
and beta_scaled
are different between the weighted and unweighted cases and they should not be (the loss function is the same). Perhaps the equivalent scaling for the weighted case should be the sum of the weights (if the intended ‘unit’ of the weights is ‘count’) and not just n_samples
. Not sure if this is the issue, but it just it made me worry that the sample_weight
argument is used in a bit naive way just as a multiplier for the loss function, while there might be scaling implications that are not accounted for when deciding when to stop.
UPDATE: It seems that the issue is with the SAGA solver. I tried with liblinear and it seems to work. This will solve my immediate problem, because for now I only want the L1-reg. It is still good to look into this, because at the moment only SAGA offers elasticnet.
Steps/Code to Reproduce
vttrifonov/logistic_sample_weights.ipynb
Expected Results
In the above code I expect the second fit to run much faster than the first and to produce the same coefficients.
Actual Results
It fails in both.
Versions
System: python: 3.7.0 (default, Jun 28 2018, 07:39:16) [Clang 4.0.1 (tags/RELEASE_401/final)] executable: /Users/vtrifonov/projects/tiny-proteins/env/bin/python machine: Darwin-19.6.0-x86_64-i386-64bit
Python dependencies: pip: 21.2.2 setuptools: 58.0.4 sklearn: 0.24.2 numpy: 1.20.3 scipy: 1.7.1 Cython: None pandas: 1.3.2 matplotlib: 3.4.2 joblib: 1.0.1 threadpoolctl: 2.2.0
Built with OpenMP: True
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (8 by maintainers)
Top GitHub Comments
I think the right fix is to be able to sample taking into account the sample_weights with a probability of w_j / sum_j w_j for feature j. It’s however not that easy to do this very efficiently. Maybe you can find code to do this on the original code of Mark Schmidt (ping @fabianp @RemiLeblond thoughts?)
I remember Mark Schmidt saying that SAG is particularly fast with importance sampling based on the Lipschitz constant of each sample. In his implementation, he uses a binary tree to perform the importance sampling. His code is rather complicated by the fact that he updates the Lipschitz constant on the fly with a line search. If we use importance sampling with fixed sample weights, a simple binary tree will do. I am not sure there is a more efficient way.