Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BayesianRidge fails when input and output data are of very different sizes

See original GitHub issue

Example 1 (working):

from sklearn.linear_model import LinearRegression, BayesianRidge
import numpy as np

ols = LinearRegression()
ols.fit(np.reshape([1,2],(-1,1)), np.array([2,3]).ravel())
clf = BayesianRidge(compute_score=True, fit_intercept=False)
clf.fit(np.array([[1,1],[1,2]]), np.array([2,3]).ravel())

print(str(ols.intercept_) + " " + str(ols.coef_[0]))
print(str(clf.coef_[0]) + " " + str(clf.coef_[1]))

Expected Results

1,1

Results for OLS and BayesianRidge:

1.0000000000000004 0.9999999999999998 0.9988917252390923 1.0005536752418909

Example 2 (not working):

from sklearn.linear_model import LinearRegression, BayesianRidge
import numpy as np

ols = LinearRegression()
ols.fit(np.reshape([1,2],(-1,1)), np.array([2000000,3000000]).ravel())
clf = BayesianRidge(compute_score=True, fit_intercept=False)
clf.fit(np.array([[1,1],[1,2]]), np.array([2000000,3000000]).ravel())

print(str(ols.intercept_) + " " + str(ols.coef_[0]))
print(str(clf.coef_[0]) + " " + str(clf.coef_[1]))

Expected Results

1000000, 1000000

Results for OLS and BayesianRidge:

1000000.0000000005 999999.9999999997 7.692319353001738e-07 1.2307710964802638e-06

Please notice that the only difference betweenn the two examples are the order of magnitude of the endogenous variable! However, although OLS works pretty well, in the second case we get 0,0 as the coefficients from the Bayes regression"

Issue Analytics

State:
Created 3 years ago
Comments:7 (2 by maintainers)

Top GitHub Comments

2reactions

Kiona0405commented, Jun 28, 2020

It is not a bug. It is caused by a bad assumption of prior distribution.

BayesianRidge make two assumption.

lambda(weights factor) is come from gamma distribution <- it is key for you
alpha(noise factor) is come from gamma distribtuion

Both distribution have two parameters(alpha_1, alpha_2, lambda_1, lambda_2). These parameters is set before training. And default value is 1e-6.

This default means that weights and noise likely have small value. In other words, before training, probability for small weight is high, but probability for large weight is low. But your data requires large weights.

Of course, model can learn from data, and fit posterior distribution. But posterior distribution is still affected by prior distribution(which prefer low weights). So model can’t estimate good weight.

To fix it, you can do below.

set lambda_2 high(ex, 10000)

Altenative way is use fit_intercept=True

from sklearn.linear_model import LinearRegression, BayesianRidge
import numpy as np

ols = LinearRegression()
ols.fit(np.reshape([1,2],(-1,1)), np.array([2000000,3000000]).ravel())
clf = BayesianRidge(compute_score=True, fit_intercept=False,
    lambda_2=10000)
clf.fit(np.array([[1,1],[1,2]]), np.array([2000000,3000000]).ravel())

print(str(ols.intercept_) + " " + str(ols.coef_[0]))
print(str(clf.coef_[0]) + " " + str(clf.coef_[1]))

out put is below.

scikit-learn ➤ python -u "/Users/Naoki/Projects/python/scikit-learn/17624-2.py"
1000000.0000000005 999999.9999999997
999999.9999979986 1000000.0000009974

0reactions

agramfortcommented, Oct 20, 2020

@SB6988 I think the best we can offer is to improve the docstring of the class. Feel free to open a PR so a next person is less likely to hit the same difficulty as you.