question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RFC] Ridge correctness evaluation

See original GitHub issue

I did some follow-up work for the discussions on potential problems for Ridge(normalize=False) in #19426 and #17444 in the following gist:

https://gist.github.com/ogrisel/cf7ac8bca6725ec02d4cc8d03ce096a7

This gist compares the results between a minimal implementation of ridge with LBFGS by @agramfort against scikit-learn’s Ridge(normalize=False, solver="cholesky"), Ridge(normalize=False, solver="lsqr"), Ridge(normalize=False, solver="svd") and other solvers.

The main conclusion is that those models seems to converge to the same solution, if we set a low value of tol and if we are not too strict on the tolerance: see how I defined atol in the gist. I cannot get convergence to machine precision (only around 1e-6/1e-7 in float64), but maybe this is expected. I am not so sure: I tried to tweak pgtol on top of factr but it did not improve.

On large-ish problems, L-BFGS can be significantly faster than our the always accurate svd solver (but not always, depends on the regularization) but never faster than the lsqr (LAPACK-based) solver. The cholesky solver (our default for dense data) is sometimes faster or slower than lsqr and L-BFGS (depending on regularization) but always faster than SVD (at least with high regularization). The sag/saga solvers are not necessarily as fast as they could be: I noticed that they do no benefit from multi-threading on a multicore machine, so that might be the reason. I have never seen then run faster than lsqr. sag/saga can be significantly faster than L-BFGS despite using fewer cores on problems with n_samples=5e3 / n_features=1e3 and not too much regularization (alpha=1e-6).

There are still problems with normalize=True as reported in #17444 but maybe we don’t care so much because we are going to deprecate this option, see #17772 by @maikia that I plan to review soon.

Proposed plan of actions:

  • proceed with the deprecation of normalize=True in #17772 and <del>therefore put #17444 on hold (or less of a priority)</del> actually no, see: #19616.
  • improve the ridge tests in scikit-learn by explicitly evaluating the min function for different solvers and maybe using a utility such as the check_min of my gist. Tight tests (but without checking the minimum explicitly) #22910
  • maybe add the ridge_bfgs function as a reference to compare against in our test suite? Since it does not seem to always converge to a very high quality optimum, I am not so sure of the value.
  • maybe we could conduct a more systematic evaluation of various datasets with different shapes, conditioning and regularization to select a better default than “cholesky” for dense data? “cholesky” is reputed to not be numerically stable, although I did not observe problems in my gist, maybe because of regularization. lsqr seems to be very promising, both in terms of speed and accuracy of the solution (according to my check_min thingy).
  • I am not sure that our default value of tol=1e-3 is safe. It’s probably very solver dependent, but I doubt that it’s a good idea for the fast lsqr solver for instance (to be benchmarked if we do the above evaluation). https://github.com/scikit-learn/scikit-learn/pull/24465

Other ideas:

  • maybe we should consider adding l-bfgs an option? That would require doing more evaluation on sparse data / ill-conditioned data with various values of alpha. Not sure of the benefit compared to lsqr or cholesky though: they seem to always be more accurate than l-bfgs and faster (especially lsqr). And l-bfgs can be very slow when alpha is small.
  • and maybe add a QR-based solver that is supposed to be more numerically stable? or is this what LAPACK is doing internally when we use the lsqr solver?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
lorentzenchrcommented, Mar 12, 2021

@ogrisel Which benchmark are you referring to?

remove sparse_cg ? I would think it’s the right thing to do for sparse X no?

Sorry, that suggestion of mine was typed too quickly and too late in the evening. I mixed up _solve_sparse_cg with cg based optimization routines. However, https://web.stanford.edu/group/SOL/software/lsqr/ states:

It [LSQR --Ed.] is algebraically equivalent to applying CG to the normal equation ( (A^T A + \lambda^2 I) x = A^T b, ) but has better numerical properties, especially if (A) is ill-conditioned.

On the corresponding LSMR page:

Special feature: Both (|r|) and (|A^T r|) decrease monotonically, where (r = b - Ax) is the current residual. For LSQR, only (|r|) is monotonic. LSQR is recommended for compatible systems (Ax=b), but on least-squares problems with loose stopping tolerances, LSMR may be able to terminate significantly sooner than LSQR.

1reaction
lorentzenchrcommented, Mar 11, 2021

I’m also on board with the Proposed plan of actions, and, like @agramfort, don’t think that adding lbfgs is necessary. A few further comments:

  • tol=1e-3 is indeed a bit large. On top, it means different things for different solvers.
  • “cholesky” solves the normal equations, i.e. computes X.T @ X + penalty, and therefore doubles the condition number of X which directly impacts the precision of solvers. The penalty should help, indeed.
  • You propose a QR based solution. I could imagine, solving [X, lambda*Identidy].T @ coef = [y, 0].T in some way could be interesting, as it avoids doubling the condition number.
  • lsmr is competing with lsqr. But in the past, I could not get a clear picture, which one to prefer in general. Both first use a Golub-Kahan bidiagonalization and then Givens rotations (leading to a QR decomposition) and solving triangular linear systems is more or less all that is done.
  • lsqr should also work with sparse input and fit_intercept. If that were the case already, I could imagine to remove “sparse_cg”.
Read more comments on GitHub >

github_iconTop Results From Across the Web

Residential Fences Corp. | Better Business Bureau® Profile
Fence Contractors in Ridge, NY. ... Be the First to Review! ... However, BBB does not verify the accuracy of information provided by...
Read more >
Fibre Channel Over TCP/IP (FCIP) RFC 3821 - IETF Datatracker
Standards Track [Page 2] RFC 3821 FCIP July 2004 1. Purpose, Motivation, and Objectives Warning to Readers Familiar With Fibre Channel: Both Fibre...
Read more >
RFC Fence | Welcome to rfcfence.com
We are Long Islands' Premier Fence Contractor. Residential Fences Corp. services Nassau and Suffolk counties, specializing in PVC Fence, Custom Wood Fence, ...
Read more >
EPA March 2022 Denial Letter of RFC 21005 Exhibits
The RFC process is intended to provide a mechanism to correct errors ... The RFC process does not require that EPA evaluate the...
Read more >
Rfc Residential Fences Corporation, Ridge, NY
Be the first to write a review for Rfc Residential Fences Corporation! Share your experience! Rating. 1 2 3 4 5.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found