question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Linear regression unnecessarily slow

See original GitHub issue

TLDR so far: sklearn uses SVD which is more numerically stable, but perhaps it should default to something quicker such as suggested below unless SVD is needed. Later post contains detailed suggestion.

Description

Standard linear regression implementation seems unnecessarily slow, at least in many cases. May extend to other linear estimators e.g. Ridge.

Steps/Code to Reproduce

from sklearn.linear_model import LinearRegression
import numpy as np
import time

def mylin(X,Y):
    CXX = np.dot(X.T,X)/X.shape[0]
    CXY = np.dot(X.T,Y)/X.shape[0]
    return np.linalg.solve(CXX,CXY).T
    
X = np.random.normal(size=[500000,40])
Y = np.random.normal(size=[500000,10])

lin = LinearRegression(False)

t = time.time()
lin.fit(X,Y)
print('Time to fit sklearn linear:', time.time()-t)

t = time.time()
coef = mylin(X,Y)
print('Time to fit analytic linear regression solution:', time.time()-t)

print('Correlation between solutions:', np.corrcoef(coef.flatten(), lin.coef_.flatten())[0,1])

Results

On my laptop, I see the following results:

Time to fit sklearn linear: 0.5804157257080078
Time to fit analytic linear regression solution: 0.023940324783325195
Correlation between solutions: 1.0

I see a similar speedup on my work machine, which is also a conda install but on Ubuntu 18.04

Versions

Could not locate executable g77 Could not locate executable f77 Could not locate executable ifort Could not locate executable ifl Could not locate executable f90 Could not locate executable efl Could not locate executable gfortran Could not locate executable f95 Could not locate executable g95 Could not locate executable efort Could not locate executable efc Could not locate executable flang don’t know how to compile Fortran code on platform ‘nt’

System: python: 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)] executable: C:\Users-–\Anaconda3\pythonw.exe machine: Windows-10-10.0.17763-SP0

BLAS: macros: lib_dirs: cblas_libs: cblas

Python deps: pip: 18.1 setuptools: 40.6.3 sklearn: 0.20.1 numpy: 1.16.3 scipy: 1.1.0 Cython: 0.29.2 pandas: 0.23.4 C:\Users\ShakesBeer\Anaconda3\lib\site-packages\numpy\distutils\system_info.py:638: UserWarning: Atlas (http://math-atlas.sourceforge.net/) libraries not found. Directories to search for the libraries can be specified in the numpy/distutils/site.cfg file (section [atlas]) or by setting the ATLAS environment variable. self.calc_info() C:\Users\ShakesBeer\Anaconda3\lib\site-packages\numpy\distutils\system_info.py:638: UserWarning: Blas (http://www.netlib.org/blas/) libraries not found. Directories to search for the libraries can be specified in the numpy/distutils/site.cfg file (section [blas]) or by setting the BLAS environment variable. self.calc_info() C:\Users\ShakesBeer\Anaconda3\lib\site-packages\numpy\distutils\system_info.py:638: UserWarning: Blas (http://www.netlib.org/blas/) sources not found. Directories to search for the sources can be specified in the numpy/distutils/site.cfg file (section [blas_src]) or by setting the BLAS_SRC environment variable. self.calc_info()

Worth mentioning that I am using MKL, and np.show_config() confirms this.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:22 (13 by maintainers)

github_iconTop GitHub Comments

3reactions
ShakesB33rcommented, May 22, 2019

The scipy/sklearn version will in fact be much more numerically stable and accurate in cases when you have near-singular (or even singular) X covariance matrices, as it uses SVD. In fact, if you try to do an SVD in python via coefs = np.dot(np.linalg.pinv(X), Y), it runs a little slower than sklearn/scipy implementation.

Perhaps the default behaviour could be to calculate the X covariance matrix (CXX), and if that is ill conditioned then to fall back on scipy implementation, otherwise to proceed as above (or something similar). To be more precise, calculate CXX, calculate its condition number cond = np.linalg.cond(CXX), and if that is above some (perhaps user-specified) threshold, e.g. 1e4, then revert back to other solution method. You could also make use of a kwarg that allows you to prefer such a solution method.

New code snippet for comparison below. If you decrease N (to e.g. 20), you will see that the SVD solution stays much closer to the sklearn one, since the conditioning of the sample covariance matrix worsens on average with fewer samples.

from sklearn.linear_model import LinearRegression
import numpy as np
import time

def mylin(X,Y):
    CXX = np.dot(X.T,X)/X.shape[0]
    CXY = np.dot(X.T,Y)/X.shape[0]
    return np.linalg.solve(CXX,CXY).T
    
N = 500000

X = np.random.normal(size=[N,40])
Y = np.random.normal(size=[N,10])

lin = LinearRegression(False)

t = time.time()
lin.fit(X,Y)
print('Time to fit sklearn linear:', time.time()-t)

t = time.time()
coef = mylin(X,Y)
print('Time to fit analytic linear regression solution with matrix inversion:', time.time()-t)

print('Deviation from sklearn solution:', np.std(coef.flatten() - lin.coef_.flatten()))

t = time.time()
coef2 = np.dot(np.linalg.pinv(X), Y).T
print('Time to fit analytic linear regression solution with SVD:', time.time()-t)

print('Deviation from sklearn solution:', np.std(coef2.flatten() - lin.coef_.flatten()))

CXX = np.dot(X.T,X)/X.shape[0]
print('Condition number of X covariance:', np.linalg.cond(CXX))
1reaction
ogriselcommented, Jul 5, 2019

Is there already an open issue to select the fastest solver for Ridge based on the shape of the data (besides the data copy issue)?

One should also open a dedicated issue for adding the cholesky solver to LinearRegression.

Read more comments on GitHub >

github_iconTop Results From Across the Web

5.1: Linear Regression and Correlation - Statistics LibreTexts
Use correlation/linear regression when you have two measurement variables, such as food intake and weight, drug dosage and blood pressure, air ...
Read more >
Efficient way of doing linear regression - Julia Discourse
Hello, I need an efficient way to performs linear regression because I have to fit several segments in a big for loops which...
Read more >
Alteryx Tip #3: Doing linear regressions the easy way!
Today we're going to be looking at one of my favorite statistical tools: the linear regression. Most people don't realize how central this...
Read more >
Robust regression - Wikipedia
In robust statistics, robust regression seeks to overcome some limitations of traditional regression analysis. A regression analysis models the relationship ...
Read more >
Evaluation metrics & Model Selection in Linear Regression
One of the assumptions of a linear regression model is that the errors must be normally distributed. This means, make sure your residuals...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found