[RFC] Ridge correctness evaluation
See original GitHub issueI did some follow-up work for the discussions on potential problems for Ridge(normalize=False)
in #19426 and #17444 in the following gist:
https://gist.github.com/ogrisel/cf7ac8bca6725ec02d4cc8d03ce096a7
This gist compares the results between a minimal implementation of ridge with LBFGS by @agramfort against scikit-learn’s Ridge(normalize=False, solver="cholesky")
, Ridge(normalize=False, solver="lsqr")
, Ridge(normalize=False, solver="svd")
and other solvers.
The main conclusion is that those models seems to converge to the same solution, if we set a low value of tol
and if we are not too strict on the tolerance: see how I defined atol in the gist. I cannot get convergence to machine precision (only around 1e-6/1e-7 in float64), but maybe this is expected. I am not so sure: I tried to tweak pgtol
on top of factr
but it did not improve.
On large-ish problems, L-BFGS can be significantly faster than our the always accurate svd solver (but not always, depends on the regularization) but never faster than the lsqr
(LAPACK-based) solver. The cholesky solver (our default for dense data) is sometimes faster or slower than lsqr and L-BFGS (depending on regularization) but always faster than SVD (at least with high regularization). The sag/saga solvers are not necessarily as fast as they could be: I noticed that they do no benefit from multi-threading on a multicore machine, so that might be the reason. I have never seen then run faster than lsqr. sag/saga can be significantly faster than L-BFGS despite using fewer cores on problems with n_samples=5e3
/ n_features=1e3
and not too much regularization (alpha=1e-6).
There are still problems with normalize=True
as reported in #17444 but maybe we don’t care so much because we are going to deprecate this option, see #17772 by @maikia that I plan to review soon.
Proposed plan of actions:
- proceed with the deprecation of
normalize=True
in #17772 and <del>therefore put #17444 on hold (or less of a priority)</del> actually no, see: #19616. - improve the ridge tests in scikit-learn by explicitly evaluating the min function for different solvers and maybe using a utility such as the
check_min
of my gist. Tight tests (but without checking the minimum explicitly) #22910 - maybe add the ridge_bfgs function as a reference to compare against in our test suite? Since it does not seem to always converge to a very high quality optimum, I am not so sure of the value.
- maybe we could conduct a more systematic evaluation of various datasets with different shapes, conditioning and regularization to select a better default than “cholesky” for dense data? “cholesky” is reputed to not be numerically stable, although I did not observe problems in my gist, maybe because of regularization.
lsqr
seems to be very promising, both in terms of speed and accuracy of the solution (according to mycheck_min
thingy). - I am not sure that our default value of
tol=1e-3
is safe. It’s probably very solver dependent, but I doubt that it’s a good idea for the fastlsqr
solver for instance (to be benchmarked if we do the above evaluation). https://github.com/scikit-learn/scikit-learn/pull/24465
Other ideas:
- maybe we should consider adding l-bfgs an option? That would require doing more evaluation on sparse data / ill-conditioned data with various values of alpha. Not sure of the benefit compared to
lsqr
orcholesky
though: they seem to always be more accurate than l-bfgs and faster (especiallylsqr
). And l-bfgs can be very slow when alpha is small. - and maybe add a QR-based solver that is supposed to be more numerically stable? or is this what LAPACK is doing internally when we use the
lsqr
solver?
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (11 by maintainers)
Top GitHub Comments
@ogrisel Which benchmark are you referring to?
Sorry, that suggestion of mine was typed too quickly and too late in the evening. I mixed up
_solve_sparse_cg
with cg based optimization routines. However, https://web.stanford.edu/group/SOL/software/lsqr/ states:On the corresponding LSMR page:
I’m also on board with the Proposed plan of actions, and, like @agramfort, don’t think that adding lbfgs is necessary. A few further comments:
tol=1e-3
is indeed a bit large. On top, it means different things for different solvers.X.T @ X + penalty
, and therefore doubles the condition number of X which directly impacts the precision of solvers. The penalty should help, indeed.[X, lambda*Identidy].T @ coef = [y, 0].T
in some way could be interesting, as it avoids doubling the condition number.fit_intercept
. If that were the case already, I could imagine to remove “sparse_cg”.