Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

data leak in GBDT due to warm start

See original GitHub issue

(This is about the non-histogram-based version of GBDTs)

X is split into train and validation data with train_test_split(random_state=self.random_state).

As @johannfaouzi noted, in a warm starting context, this will produce a leak if If self.random_state is a RandomState instance: some samples that were used for training in a previous fit might be used for validation now.

~~I think the right fix would be to raise a ValueError if the provided random state isn’t a number and early-stopping is activated~~

Issue Analytics

State:
Created 4 years ago
Comments:11 (11 by maintainers)

Top GitHub Comments

1reaction

ogriselcommented, Sep 13, 2019

+1 for storing a seed as fit param and reuse that to seed an rng in fit only when warm_start=True.

AFAIK, np.random.RandomState accept uint32 seed only (between 0 and 2**32 - 1). So the correct way to get a seed from an existing random state object is:

self.random_seed_ = check_random_state(self.random_state).randint(np.iinfo(np.uint32).max)

0reactions

johannfaouzicommented, Sep 19, 2019

Hmm this issue was about the non-histogram-based version of GBDTs, so #14999 did not do anything about this. I intentionally did not use keywords so that this issue would not be closed.

However I’ve just been starting working on a fix for this issue 😃

Top Results From Across the Web

Privacy-Preserving Gradient Boosting Decision Trees - arXiv

Existing solutions for GBDT with differential privacy suf- fer from the significant accuracy loss due to too loose sen- sitivity bounds and ineffective ......

Implementing Gradient Boosting Regression in Python

In this article we'll start with an introduction to gradient boosting for regression problems, what makes it so advantageous, and its different parameters....

Introduction to gradient boosting on decision trees with Catboost

Before talking about gradient boosting I will start with decision trees. A tree as a data structure has many analogies in real life....

How Do Gradient Boosting Algorithms Handle Categorical ...

The Limitations of One-Hot Encoding ... to solve a common issue that arises when using such a target encoding, which is target leakage....

A Gentle Introduction to the Gradient Boosting Algorithm for ...

How gradient boosting works including the loss function, weak learners and ... Kick-start your project with my new book XGBoost With Python, ...