Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fix gradient boosting quantile regression

See original GitHub issue

Describe the workflow you want to enable

The quantile loss function used for the Gradient Boosting Classifier is too conservative in its predictions for extreme values.

This makes the quantile regression almost equivalent to looking up the dataset’s quantile, which is not really useful.

Describe your proposed solution

Use the same type of loss function as in the scikit-garden package.

Describe alternatives you’ve considered, if relevant

When the GB classifier is overfitting, this behavior seems to be going away.

Additional context

import pandas as pd
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import GradientBoostingRegressor
from skgarden import RandomForestQuantileRegressor

data = load_boston()
X = pd.DataFrame(data=data["data"], columns=data["feature_names"])
y = pd.Series(data=data["target"])

# with sklearn:
gb_learn = GradientBoostingRegressor(loss="quantile", n_estimators=20, max_depth=10)

gb_learn.set_params(alpha=0.5)
gb_learn.fit(X, y)
pred_learn_median = gb_learn.predict(X)
gb_learn.set_params(alpha=0.05)
gb_learn.fit(X, y)
pred_learn_m_ci = gb_learn.predict(X)
gb_learn.set_params(alpha=0.95)
gb_learn.fit(X, y)
pred_learn_p_ci = gb_learn.predict(X)

fig = plt.figure(figsize=(12, 8))
sns.scatterplot(x=y, y=pred_learn_median, label="Median")
sns.scatterplot(x=y, y=pred_learn_m_ci, label="5% quantile")
sns.scatterplot(x=y, y=pred_learn_p_ci, label="95% quantile")
plt.plot([0, 50], [0, 50], c="red")
sns.despine()
plt.xlabel("True value")
plt.ylabel("Predicted value")
plt.show()

# with skgarden
rf_garden = RandomForestQuantileRegressor(n_estimators=20, max_depth=3)
pred_garden_median = rf_garden.predict(X, quantile=50)
pred_garden_m_ci = rf_garden.predict(X, quantile=5)
pred_garden_p_ci = rf_garden.predict(X, quantile=95)

fig = plt.figure(figsize=(12, 8))
sns.scatterplot(x=y, y=pred_garden_median, label="Median")
sns.scatterplot(x=y, y=pred_garden_m_ci, label="5% quantile")
sns.scatterplot(x=y, y=pred_garden_p_ci, label="95% quantile")
plt.plot([0, 50], [0, 50], c="red")
sns.despine()
plt.xlabel("True value")
plt.ylabel("Predicted value")
plt.show()

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:22 (9 by maintainers)

Top GitHub Comments

1reaction

Bougeantcommented, Nov 18, 2020

Hey @lorentzenchr,

Thanks for your feedback. While I agree that the RF and GB are not 100% comparable, the GB overfits more (see median case which is closer to y=x) and so should be doing better for the quantiles as well, which is not the case.

I still think the skgarden approach has 2 benefits:

because the splits are made to minimise MSE, you can reuse the same model (without retraining) for all quantiles. It relies on the distribution of the samples sharing the same leaves to estimate the quantiles for a particular prediction. I am not sure this approach would work with a gradient boosting model though because the distribution in each leave are not independent.
objectively, it seems pretty clear that the skgarden model is working much better than the sklearn model. For example, consider a point where the true value (y_true) is 15. The sklearn model predicts that the median is around 15.0, which is great, but then goes to predict that the 5% quantile is around 13.0 which seems a bit too close, and the 95% is around 30.0, which seems way too far. The skgarden model makes much more sensible predictions in this case (5%: ~10.0, 50%: ~15.0, 95%: ~20.0). Moreover, it seems extremely dodgy that the sklearn model thinks that when its estimated median is 30.0, its 95% quantile is 30.0, but when its estimated median is 10.0, its 95% quantile is still 30.0.

Ignoring the debate about which loss makes more sense from a theoretical standpoint, this second point makes the sklearn quantile regression model unusable in any practical application.

I think this issue should not be dismissed so quickly and without any debate.

0reactions

Bougeantcommented, Feb 16, 2021

Really cool. It does make a lot of sense that the predicting two percentiles with different levels of noise would require different amount of over/under-fitting.

Top Results From Across the Web

Quantile Regression with Gradient Boosted Trees - b.telligent

The rest of this blog will show why it's hard and explain how the difficulty can be solved to produce quantile predictions with...

Prediction Intervals for Gradient Boosting Regression

This example shows how quantile regression can be used to create prediction intervals. ... Fit gradient boosting models trained with the quantile loss...

Gradient boosting for extreme quantile regression - arXiv

We propose a gradient boosting procedure to estimate a conditional generalized Pareto distribution by minimizing its deviance.

A Tutorial on Quantile Regression, Quantile Random Forests ...

In this post I will walk you through step-by-step Quantile Regression, then Quantile Gradient Boosting, and Quantile Random Forests.

python - How Do I Build a Quantile Regression Model with ...

The scoring parameter should correspond to the quantile of interest α. ... generate prediction intervals for gradient boosting regression.