question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

data leak in GBDT due to warm start

See original GitHub issue

(This is about the non-histogram-based version of GBDTs)

X is split into train and validation data with train_test_split(random_state=self.random_state).

As @johannfaouzi noted, in a warm starting context, this will produce a leak if If self.random_state is a RandomState instance: some samples that were used for training in a previous fit might be used for validation now.

I think the right fix would be to raise a ValueError if the provided random state isn’t a number and early-stopping is activated

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
ogriselcommented, Sep 13, 2019

+1 for storing a seed as fit param and reuse that to seed an rng in fit only when warm_start=True.

AFAIK, np.random.RandomState accept uint32 seed only (between 0 and 2**32 - 1). So the correct way to get a seed from an existing random state object is:

self.random_seed_ = check_random_state(self.random_state).randint(np.iinfo(np.uint32).max)
0reactions
johannfaouzicommented, Sep 19, 2019

Hmm this issue was about the non-histogram-based version of GBDTs, so #14999 did not do anything about this. I intentionally did not use keywords so that this issue would not be closed.

However I’ve just been starting working on a fix for this issue 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Privacy-Preserving Gradient Boosting Decision Trees - arXiv
Existing solutions for GBDT with differential privacy suf- fer from the significant accuracy loss due to too loose sen- sitivity bounds and ineffective ......
Read more >
Implementing Gradient Boosting Regression in Python
In this article we'll start with an introduction to gradient boosting for regression problems, what makes it so advantageous, and its different parameters....
Read more >
Introduction to gradient boosting on decision trees with Catboost
Before talking about gradient boosting I will start with decision trees. A tree as a data structure has many analogies in real life....
Read more >
How Do Gradient Boosting Algorithms Handle Categorical ...
The Limitations of One-Hot Encoding ... to solve a common issue that arises when using such a target encoding, which is target leakage....
Read more >
A Gentle Introduction to the Gradient Boosting Algorithm for ...
How gradient boosting works including the loss function, weak learners and ... Kick-start your project with my new book XGBoost With Python, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found