GaussianProcessRegressor (predict)
See original GitHub issueDiscussed in https://github.com/scikit-learn/scikit-learn/discussions/22925
<div type='discussions-op-text'>Originally posted by jecampagne March 23, 2022
Hello,
I am questioning the code of predict
of the GaussianProcessRegressor
. The code is based on Alg. 2.1 of C. E. Rasmussen & C. K. I. Williams (2003). I have the 2006 version and I do not know if there have been a modification between the two versions. Well, the algorithm is based on the following formula:
Notice that the K(X,X)
(ie, X=X_train) is the only one that contains the “noise” parameter while K(X*, X)
and K(X*,X*)
(ie, X*=X_test) do not get this additional part to the diagonal. And this is ok.
Looking now at the code it is ok for the default kernel (ie. RBF with fixed scale and length):
self.kernel_ = C(1.0, constant_value_bounds="fixed") * RBF(
1.0, length_scale_bounds="fixed"
)
But, if the user pass a kernel composed with the WhiteKernel
as
kernel = C(1.0) * RBF() + WhiteKernel(0.5)
then it seems that K(X*, X)
and K(X*,X*)
will use the WhiteKernel
part which is not what the Alg. 2.1 of Rasmussen & Williams does.
Issue Analytics
- State:
- Created a year ago
- Comments:10 (6 by maintainers)
Top GitHub Comments
I set up a similar comparison between R&W textbook equations (eqns. 2.23, 2.24 which lead to Alg. 2.1),
GPy
,tinygp
,gpytorch
andsklearn
. You can find it in ~this gist~ this repo. Thesklearn
bits are at the very end.I think what’s happening is that
sklearn
predict
=GPy
predict_noiseless
= Alg. 2.1 whenWhiteKernel
is not used, theny_cov
=cov(f*)
from eq. 2.24 because then you implynoise_level
=0. You only haveGaussianProcessRegressor(alpha=...
) (default is1e-10
by the way) whis is added to the diag duringfit()
in the same way asnoise_level
would. But because it is “small” is it usually not considered noise but a regularization (or “jitter”) param, which is odd because it has the exact same effect on the fit weights and thus ony_mean
as a noise param would. This is true for most GP implementations. There seems to be a magic threshold above which people start calling it noise.sklearn
predict
=GPy
predict
!= Alg. 2.1 whenWhiteKernel
is used, theny_cov = cov(f*) + eye(...) * noise_level
. I think that’s what R&W 2006, page 18, part “noisy predictions” refers to, which is about the only resource on this I’m aware of. Most other GP resources I’ve looked at basically discuss eqns. 2.23 and 2.24 only and then call it a day, which doesn’t exactly help to clarify things.If
kernel(X,X)
is called inpredict()
instead ofkernel(X)
, then it behaves asGPy
predict_noiseless()
.The central question regarding this issue here is whether this behavior is intended, given that the other 3 tested packages expose the distinction between the behaviors equivalent to
predict
vs.predict_noiseless
to the user.However
predict()
also calculatesy_cov
. The code doesy_cov = self.kernel_(X) - V.T @ V
. If the above explanation is correct, then one might need to useself.kernel_(X, X)
instead, which would beK(X*, X*)
in R&W 2006 (eq. 2.24, 2.26, Alg. 2.1 line 6). The same probably applies toy_std
.