Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

The fit performance of LinearRegression is sub-optimal

See original GitHub issue

It seems that the performance of Linear Regression is sub-optimal when the number of samples is very large.

sklearn_benchmarks measures a speedup of 48 compared to an optimized implementation from scikit-learn-intelex on a 1000000x100 dataset. For a given set of parameters and a given dataset, we compute the speed-up time scikit-learn / time sklearnex. A speed-up of 48 means that sklearnex is 48 times faster than scikit-learn on the given dataset.

Profiling allows a more detailed analysis of the execution of the algorithm. We observe that most of the execution time is spent in the lstsq solver of scipy.

The profiling reports of sklearn_benchmarks can be viewed with Perfetto UI.

See benchmark environment information

It seems that the solver could be better chosen when the number of samples is very large. Perhaps Ridge’s solver with a zero penalty could be chosen in this case. On the same dimensions, it shows better performance.

Speedups can be reproduced with the following code:

conda create -n lr_perf -c conda-forge scikit-learn scikit-learn-intelex numpy jupyter
conda activate lr_perf

from sklearn.linear_model import LinearRegression as LinearRegressionSklearn
from sklearnex.linear_model import LinearRegression as LinearRegressionSklearnex
from sklearn.datasets import make_regression
import time
import numpy as np

X, y = make_regression(n_samples=1_000_000, n_features=100, n_informative=10)

def measure(estimator, X, y, n_executions=10):
    times = []
    while len(times) < n_executions:
        t0 = time.perf_counter()
        estimator.fit(X, y)
        t1 = time.perf_counter()
        times.append(t1 - t0)

    return np.mean(times)

mean_time_sklearn = measure(
    estimator=LinearRegressionSklearn(),
    X=X,
    y=y
)

mean_time_sklearnex = measure(
    estimator=LinearRegressionSklearnex(),
    X=X,
    y=y
)

speedup = mean_time_sklearn / mean_time_sklearnex
speedup

Issue Analytics

State:
Created 2 years ago
Reactions:2
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

glemaitrecommented, Mar 18, 2022

I would not go so far as to say LinearRegression is terrible.

It is what I found reading your comment afterwards 😃 Hoping that LinearRegression can forgive me for this comment 😃

Regarding using the LAPACK _gesv with np.linagl.solve:

# %%
%%time
coef = np.linalg.solve(X_with_dummy.T @ X_with_dummy, X_with_dummy.T @ y)

CPU times: user 10.2 s, sys: 370 ms, total: 10.5 s
Wall time: 1.47 s

It is then as efficient than LBFGS.

1reaction

TomDLTcommented, Mar 16, 2022

Perhaps Ridge’s solver with a zero penalty could be chosen in this case. On the same dimensions, it shows better performance.

Indeed, one can get a good speedup (x13 on my machine) with Ridge(alpha=0).

Arguably, Ridge should always be preferred instead of LinearRegression anyway.

import time
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.datasets import make_regression

def measure(estimator, X, y, n_executions=10):
    times = []
    while len(times) < n_executions:
        t0 = time.perf_counter()
        estimator.fit(X, y)
        t1 = time.perf_counter()
        times.append(t1 - t0)
    return np.mean(times)

X, y = make_regression(n_samples=1_000_000, n_features=100, n_informative=10)

mean_time_linear_regression = measure(estimator=LinearRegression(), X=X, y=y)
mean_time_ridge = measure(estimator=Ridge(alpha=0), X=X, y=y)
print("speedup =", mean_time_linear_regression / mean_time_ridge)