data leak in GBDT due to warm start
See original GitHub issue(This is about the non-histogram-based version of GBDTs)
X is split into train and validation data with train_test_split(random_state=self.random_state).
As @johannfaouzi noted, in a warm starting context, this will produce a leak if If self.random_state is a RandomState instance: some samples that were used for training in a previous fit might be used for validation now.
I think the right fix would be to raise a ValueError if the provided random state isn’t a number and early-stopping is activated
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (11 by maintainers)
Top Results From Across the Web
Privacy-Preserving Gradient Boosting Decision Trees - arXiv
Existing solutions for GBDT with differential privacy suf- fer from the significant accuracy loss due to too loose sen- sitivity bounds and ineffective ......
Read more >Implementing Gradient Boosting Regression in Python
In this article we'll start with an introduction to gradient boosting for regression problems, what makes it so advantageous, and its different parameters....
Read more >Introduction to gradient boosting on decision trees with Catboost
Before talking about gradient boosting I will start with decision trees. A tree as a data structure has many analogies in real life....
Read more >How Do Gradient Boosting Algorithms Handle Categorical ...
The Limitations of One-Hot Encoding ... to solve a common issue that arises when using such a target encoding, which is target leakage....
Read more >A Gentle Introduction to the Gradient Boosting Algorithm for ...
How gradient boosting works including the loss function, weak learners and ... Kick-start your project with my new book XGBoost With Python, ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

+1 for storing a seed as fit param and reuse that to seed an rng in fit only when
warm_start=True.AFAIK,
np.random.RandomStateacceptuint32seed only (between0and2**32 - 1). So the correct way to get a seed from an existing random state object is:Hmm this issue was about the non-histogram-based version of GBDTs, so #14999 did not do anything about this. I intentionally did not use keywords so that this issue would not be closed.
However I’ve just been starting working on a fix for this issue 😃