question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

The fit performance of LinearRegression is sub-optimal

See original GitHub issue

It seems that the performance of Linear Regression is sub-optimal when the number of samples is very large.

sklearn_benchmarks measures a speedup of 48 compared to an optimized implementation from scikit-learn-intelex on a 1000000x100 dataset. For a given set of parameters and a given dataset, we compute the speed-up time scikit-learn / time sklearnex. A speed-up of 48 means that sklearnex is 48 times faster than scikit-learn on the given dataset.

Profiling allows a more detailed analysis of the execution of the algorithm. We observe that most of the execution time is spent in the lstsq solver of scipy.

image

The profiling reports of sklearn_benchmarks can be viewed with Perfetto UI.

See benchmark environment information image

It seems that the solver could be better chosen when the number of samples is very large. Perhaps Ridge’s solver with a zero penalty could be chosen in this case. On the same dimensions, it shows better performance.

Speedups can be reproduced with the following code:

conda create -n lr_perf -c conda-forge scikit-learn scikit-learn-intelex numpy jupyter
conda activate lr_perf
from sklearn.linear_model import LinearRegression as LinearRegressionSklearn
from sklearnex.linear_model import LinearRegression as LinearRegressionSklearnex
from sklearn.datasets import make_regression
import time
import numpy as np

X, y = make_regression(n_samples=1_000_000, n_features=100, n_informative=10)

def measure(estimator, X, y, n_executions=10):
    times = []
    while len(times) < n_executions:
        t0 = time.perf_counter()
        estimator.fit(X, y)
        t1 = time.perf_counter()
        times.append(t1 - t0)

    return np.mean(times)

mean_time_sklearn = measure(
    estimator=LinearRegressionSklearn(),
    X=X,
    y=y
)

mean_time_sklearnex = measure(
    estimator=LinearRegressionSklearnex(),
    X=X,
    y=y
)

speedup = mean_time_sklearn / mean_time_sklearnex
speedup

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:2
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
glemaitrecommented, Mar 18, 2022

I would not go so far as to say LinearRegression is terrible.

It is what I found reading your comment afterwards 😃 Hoping that LinearRegression can forgive me for this comment 😃

Regarding using the LAPACK _gesv with np.linagl.solve:

# %%
%%time
coef = np.linalg.solve(X_with_dummy.T @ X_with_dummy, X_with_dummy.T @ y)
CPU times: user 10.2 s, sys: 370 ms, total: 10.5 s
Wall time: 1.47 s

It is then as efficient than LBFGS.

1reaction
TomDLTcommented, Mar 16, 2022

Perhaps Ridge’s solver with a zero penalty could be chosen in this case. On the same dimensions, it shows better performance.

Indeed, one can get a good speedup (x13 on my machine) with Ridge(alpha=0).

Arguably, Ridge should always be preferred instead of LinearRegression anyway.

import time
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.datasets import make_regression

def measure(estimator, X, y, n_executions=10):
    times = []
    while len(times) < n_executions:
        t0 = time.perf_counter()
        estimator.fit(X, y)
        t1 = time.perf_counter()
        times.append(t1 - t0)
    return np.mean(times)

X, y = make_regression(n_samples=1_000_000, n_features=100, n_informative=10)

mean_time_linear_regression = measure(estimator=LinearRegression(), X=X, y=y)
mean_time_ridge = measure(estimator=Ridge(alpha=0), X=X, y=y)
print("speedup =", mean_time_linear_regression / mean_time_ridge)
Read more comments on GitHub >

github_iconTop Results From Across the Web

When Optimization is Suboptimal - DeepLearning.AI
A theoretical study shows that gradient descent itself may introduce such bias and render algorithms unable to fit data properly.
Read more >
The Assumptions Of Linear Regression, And How To Test Them
The second assumption that one makes while fitting OLSR models is that the residual errors left over from fitting the model to the...
Read more >
4. Training Models - Hands-On Machine Learning with Scikit ...
In this chapter, we will start by looking at the Linear Regression model, one of the simplest models there is. We will discuss...
Read more >
A New Perceptual Bias Reveals Suboptimal ... - PLOS
The full line indicates the linear regression fit. The broken line denotes the performance of the ideal observer. Figure 7.
Read more >
Model Building with caret (Numeric Outcomes)
Linear regression is a type of parametric model, meaning there are two steps to applying a linear regression model. We must first specify...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found