Fix gradient boosting quantile regression
See original GitHub issueDescribe the workflow you want to enable
The quantile loss function used for the Gradient Boosting Classifier is too conservative in its predictions for extreme values.
This makes the quantile regression almost equivalent to looking up the dataset’s quantile, which is not really useful.
Describe your proposed solution
Use the same type of loss function as in the scikit-garden package.
Describe alternatives you’ve considered, if relevant
When the GB classifier is overfitting, this behavior seems to be going away.
Additional context
import pandas as pd
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import GradientBoostingRegressor
from skgarden import RandomForestQuantileRegressor
data = load_boston()
X = pd.DataFrame(data=data["data"], columns=data["feature_names"])
y = pd.Series(data=data["target"])
# with sklearn:
gb_learn = GradientBoostingRegressor(loss="quantile", n_estimators=20, max_depth=10)
gb_learn.set_params(alpha=0.5)
gb_learn.fit(X, y)
pred_learn_median = gb_learn.predict(X)
gb_learn.set_params(alpha=0.05)
gb_learn.fit(X, y)
pred_learn_m_ci = gb_learn.predict(X)
gb_learn.set_params(alpha=0.95)
gb_learn.fit(X, y)
pred_learn_p_ci = gb_learn.predict(X)
fig = plt.figure(figsize=(12, 8))
sns.scatterplot(x=y, y=pred_learn_median, label="Median")
sns.scatterplot(x=y, y=pred_learn_m_ci, label="5% quantile")
sns.scatterplot(x=y, y=pred_learn_p_ci, label="95% quantile")
plt.plot([0, 50], [0, 50], c="red")
sns.despine()
plt.xlabel("True value")
plt.ylabel("Predicted value")
plt.show()
# with skgarden
rf_garden = RandomForestQuantileRegressor(n_estimators=20, max_depth=3)
pred_garden_median = rf_garden.predict(X, quantile=50)
pred_garden_m_ci = rf_garden.predict(X, quantile=5)
pred_garden_p_ci = rf_garden.predict(X, quantile=95)
fig = plt.figure(figsize=(12, 8))
sns.scatterplot(x=y, y=pred_garden_median, label="Median")
sns.scatterplot(x=y, y=pred_garden_m_ci, label="5% quantile")
sns.scatterplot(x=y, y=pred_garden_p_ci, label="95% quantile")
plt.plot([0, 50], [0, 50], c="red")
sns.despine()
plt.xlabel("True value")
plt.ylabel("Predicted value")
plt.show()
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:22 (9 by maintainers)
Top Results From Across the Web
Quantile Regression with Gradient Boosted Trees - b.telligent
The rest of this blog will show why it's hard and explain how the difficulty can be solved to produce quantile predictions with...
Read more >Prediction Intervals for Gradient Boosting Regression
This example shows how quantile regression can be used to create prediction intervals. ... Fit gradient boosting models trained with the quantile loss...
Read more >Gradient boosting for extreme quantile regression - arXiv
We propose a gradient boosting procedure to estimate a conditional generalized Pareto distribution by minimizing its deviance.
Read more >A Tutorial on Quantile Regression, Quantile Random Forests ...
In this post I will walk you through step-by-step Quantile Regression, then Quantile Gradient Boosting, and Quantile Random Forests.
Read more >python - How Do I Build a Quantile Regression Model with ...
The scoring parameter should correspond to the quantile of interest α. ... generate prediction intervals for gradient boosting regression.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hey @lorentzenchr,
Thanks for your feedback. While I agree that the RF and GB are not 100% comparable, the GB overfits more (see median case which is closer to y=x) and so should be doing better for the quantiles as well, which is not the case.
I still think the skgarden approach has 2 benefits:
because the splits are made to minimise MSE, you can reuse the same model (without retraining) for all quantiles. It relies on the distribution of the samples sharing the same leaves to estimate the quantiles for a particular prediction. I am not sure this approach would work with a gradient boosting model though because the distribution in each leave are not independent.
objectively, it seems pretty clear that the skgarden model is working much better than the sklearn model. For example, consider a point where the true value (y_true) is 15. The sklearn model predicts that the median is around 15.0, which is great, but then goes to predict that the 5% quantile is around 13.0 which seems a bit too close, and the 95% is around 30.0, which seems way too far. The skgarden model makes much more sensible predictions in this case (5%: ~10.0, 50%: ~15.0, 95%: ~20.0). Moreover, it seems extremely dodgy that the sklearn model thinks that when its estimated median is 30.0, its 95% quantile is 30.0, but when its estimated median is 10.0, its 95% quantile is still 30.0.
Ignoring the debate about which loss makes more sense from a theoretical standpoint, this second point makes the sklearn quantile regression model unusable in any practical application.
I think this issue should not be dismissed so quickly and without any debate.
Really cool. It does make a lot of sense that the predicting two percentiles with different levels of noise would require different amount of over/under-fitting.