question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fix gradient boosting quantile regression

See original GitHub issue

Describe the workflow you want to enable

The quantile loss function used for the Gradient Boosting Classifier is too conservative in its predictions for extreme values.

This makes the quantile regression almost equivalent to looking up the dataset’s quantile, which is not really useful.

Describe your proposed solution

Use the same type of loss function as in the scikit-garden package.

Describe alternatives you’ve considered, if relevant

When the GB classifier is overfitting, this behavior seems to be going away.

Additional context

import pandas as pd
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import GradientBoostingRegressor
from skgarden import RandomForestQuantileRegressor

data = load_boston()
X = pd.DataFrame(data=data["data"], columns=data["feature_names"])
y = pd.Series(data=data["target"])

# with sklearn:
gb_learn = GradientBoostingRegressor(loss="quantile", n_estimators=20, max_depth=10)

gb_learn.set_params(alpha=0.5)
gb_learn.fit(X, y)
pred_learn_median = gb_learn.predict(X)
gb_learn.set_params(alpha=0.05)
gb_learn.fit(X, y)
pred_learn_m_ci = gb_learn.predict(X)
gb_learn.set_params(alpha=0.95)
gb_learn.fit(X, y)
pred_learn_p_ci = gb_learn.predict(X)

fig = plt.figure(figsize=(12, 8))
sns.scatterplot(x=y, y=pred_learn_median, label="Median")
sns.scatterplot(x=y, y=pred_learn_m_ci, label="5% quantile")
sns.scatterplot(x=y, y=pred_learn_p_ci, label="95% quantile")
plt.plot([0, 50], [0, 50], c="red")
sns.despine()
plt.xlabel("True value")
plt.ylabel("Predicted value")
plt.show()

# with skgarden
rf_garden = RandomForestQuantileRegressor(n_estimators=20, max_depth=3)
pred_garden_median = rf_garden.predict(X, quantile=50)
pred_garden_m_ci = rf_garden.predict(X, quantile=5)
pred_garden_p_ci = rf_garden.predict(X, quantile=95)

fig = plt.figure(figsize=(12, 8))
sns.scatterplot(x=y, y=pred_garden_median, label="Median")
sns.scatterplot(x=y, y=pred_garden_m_ci, label="5% quantile")
sns.scatterplot(x=y, y=pred_garden_p_ci, label="95% quantile")
plt.plot([0, 50], [0, 50], c="red")
sns.despine()
plt.xlabel("True value")
plt.ylabel("Predicted value")
plt.show()

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:22 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
Bougeantcommented, Nov 18, 2020

Hey @lorentzenchr,

Thanks for your feedback. While I agree that the RF and GB are not 100% comparable, the GB overfits more (see median case which is closer to y=x) and so should be doing better for the quantiles as well, which is not the case.

I still think the skgarden approach has 2 benefits:

  1. because the splits are made to minimise MSE, you can reuse the same model (without retraining) for all quantiles. It relies on the distribution of the samples sharing the same leaves to estimate the quantiles for a particular prediction. I am not sure this approach would work with a gradient boosting model though because the distribution in each leave are not independent.

  2. objectively, it seems pretty clear that the skgarden model is working much better than the sklearn model. For example, consider a point where the true value (y_true) is 15. The sklearn model predicts that the median is around 15.0, which is great, but then goes to predict that the 5% quantile is around 13.0 which seems a bit too close, and the 95% is around 30.0, which seems way too far. The skgarden model makes much more sensible predictions in this case (5%: ~10.0, 50%: ~15.0, 95%: ~20.0). Moreover, it seems extremely dodgy that the sklearn model thinks that when its estimated median is 30.0, its 95% quantile is 30.0, but when its estimated median is 10.0, its 95% quantile is still 30.0.

Ignoring the debate about which loss makes more sense from a theoretical standpoint, this second point makes the sklearn quantile regression model unusable in any practical application.

I think this issue should not be dismissed so quickly and without any debate.

0reactions
Bougeantcommented, Feb 16, 2021

Really cool. It does make a lot of sense that the predicting two percentiles with different levels of noise would require different amount of over/under-fitting.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Quantile Regression with Gradient Boosted Trees - b.telligent
The rest of this blog will show why it's hard and explain how the difficulty can be solved to produce quantile predictions with...
Read more >
Prediction Intervals for Gradient Boosting Regression
This example shows how quantile regression can be used to create prediction intervals. ... Fit gradient boosting models trained with the quantile loss...
Read more >
Gradient boosting for extreme quantile regression - arXiv
We propose a gradient boosting procedure to estimate a conditional generalized Pareto distribution by minimizing its deviance.
Read more >
A Tutorial on Quantile Regression, Quantile Random Forests ...
In this post I will walk you through step-by-step Quantile Regression, then Quantile Gradient Boosting, and Quantile Random Forests.
Read more >
python - How Do I Build a Quantile Regression Model with ...
The scoring parameter should correspond to the quantile of interest α. ... generate prediction intervals for gradient boosting regression.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found