Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GradientBoostingRegressor with huber loss sometimes fails with an `IndexError: cannot do a non-empty take from an empty axes.`

See original GitHub issue

If I use the first 63726 lines of my dataset for training everything works, but if I add one line more to the training set, then I get the error. My dataset contains no NaNs. The 63727th line has no obvious differences from the others, furthermore if I use lines 63000:64000 for training then I don’t get the error, which suggests that the content of line 63727 isn’t directly the problem.

I have tried and failed to make a small reproducible test, so I hope someone can make sense of what is happening here and why.

Versions

Windows-2008ServerR2-6.1.7601-SP1 Python 3.6.3 |Intel Corporation| (default, Oct 17 2017, 23:26:12) [MSC v.1900 64 bit (AMD64)] NumPy 1.13.3 SciPy 0.19.1 Scikit-Learn 0.19.0

Error message which I got when using RandomSearchCV

Sub-process traceback:
---------------------------------------------------------------------------
IndexError                                         Thu Dec  7 19:08:00 2017
PID: 11688        Python 3.6.3: C:\ProgramData\Anaconda3\envs\i3\python.exe
...........................................................................
C:\ProgramData\Anaconda3\envs\i3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        self.items = [(<function _fit_and_score>, (GradientBoostingRegressor(alpha=0.9, criterion='...le=1.0, verbose=0, warm_start=False), memmap([[ -1.00000000e+00,  -2.00000003e-01,  -1 ....14299998e-02,   1.42734203e+01]], dtype=float32), memmap([ 27.,  35., -78., ..., -19.,  -9.,  -4.], type=float32), {'score': <function MaxWinRate>}, array([    0,     1,     2, ..., 12997, 12998, 12999]), memmap([ 13000,  13001,  13002, ..., 618893, 618894, 618895]), 10, {'learning_rate': 0.28522087352060566, 'max_depth': 16, 'max_features': 0.32058686573551248, 'min_samples_leaf': 1, 'n_estimators': 327}), {'error_score': 'raise', 'fit_params': {}, 'return_n_test_samples': True, 'return_parameters': False, 'return_times': True, 'return_train_score': True})]
    132
    133     def __len__(self):
    134         return self._size
    135

...........................................................................
C:\ProgramData\Anaconda3\envs\i3\lib\site-packages\sklearn\externals\joblib\parallel.py in <listcomp>(.0=<list_iterator object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        func = <function _fit_and_score>
        args = (GradientBoostingRegressor(alpha=0.9, criterion='...le=1.0, verbose=0, warm_start=False), memmap([[ -1.00000000e+00,  -2.00000003e-01,  -1 ....14299998e-02,   1.42734203e+01]], dtype=float32), memmap([ 27.,  35., -78., ..., -19.,  -9.,  -4.], dtype=float32), {'score': <function MaxWinRate>}, array([    0,     1,     2, ..., 12997, 12998, 12999]), memmap([ 13000,  13001,  13002, ..., 618893, 618894, 618895]), 10, {'learning_rate': 0.28522087352060566, 'max_depth': 16, 'max_features': 0.32058686573551248, 'min_samples_leaf': 1, 'n_estimators': 327})
        kwargs = {'error_score': 'raise', 'fit_params': {}, 'return_n_test_samples': True, 'return_parameters': False, 'return_times': True, 'return_train_score': True}
    132
    133     def __len__(self):
    134         return self._size
    135

...........................................................................
C:\ProgramData\Anaconda3\envs\i3\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_score(estimator=GradientBoostingRegressor(alpha=0.9, criterion='...le=1.0, verbose=0, warm_start=False), X=memmap([[ -1.00000000e+00,  -2.00000003e-01, -1....14299998e-02,   1.42734203e+01]], dtype=float32), y=memmap([ 27.,  35., -78., ..., -19.,  -9.,  -4.], dtype=float32), scorer={'score': <function MaxWinRate>}, train=array([    0,     1,     2, ..., 12997, 12998, 12999]), test=memmap([ 13000,  13001,  13002, ..., 618893, 618894, 618895]), verbose=10, parameters={'learning_rate': 0.28522087352060566, 'max_depth': 16, 'max_features': 0.32058686573551248, 'min_samples_leaf': 1, 'n_estimators': 327}, fit_params={}, return_train_score=True, return_parameters=False, return_n_test_samples=True, return_times=True, error_score='raise')
    432
    433     try:
    434         if y_train is None:
    435             estimator.fit(X_train, **fit_params)
    436         else:
--> 437             estimator.fit(X_train, y_train, **fit_params)
        estimator.fit = <bound method BaseGradientBoosting.fit of Gradie...e=1.0, verbose=0, warm_start=False)>
        X_train = memmap([[ -1.00000000e+00,  -2.00000003e-01,  -1....04051296e+09,   1.89901295e+01]], dtype=float32)
        y_train = memmap([ 27.,  35., -78., ...,   9.,  21.,  20.], dtype=float32)
        fit_params = {}
    438
    439     except Exception as e:
    440         # Note fit time as time until error
    441         fit_time = time.time() - start_time

...........................................................................
C:\ProgramData\Anaconda3\envs\i3\lib\site-packages\sklearn\ensemble\gradient_boosting.py in fit(self=GradientBoostingRegressor(alpha=0.9, criterion='...le=1.0, verbose=0, warm_start=False), X=array([[ -1.00000000e+00,  -2.00000003e-01,  -1.....04051296e+09,   1.89901295e+01]], dtype=float32), y=memmap([ 27.,  35., -78., ...,   9.,  21.,  20.], dtype=float32), sample_weight=array([ 1.,  1.,  1., ...,  1.,  1.,  1.], dtype=float32), monitor=None)
   1029                 X_idx_sorted = np.asfortranarray(np.argsort(X, axis=0),
   1030                                                  dtype=np.int32)
   1031
   1032         # fit the boosting stages
   1033         n_stages = self._fit_stages(X, y, y_pred, sample_weight, random_state,
-> 1034                                     begin_at_stage, monitor, X_idx_sorted)
        begin_at_stage = 0
        monitor = None
        X_idx_sorted = array([[    0,  8619,     0, ...,   859,   869, ...[ 6499,  7945,  6499, ..., 11039, 10053,  8487]])
   1035         # change shape of arrays after fit (early-stopping or additional ests)
   1036         if n_stages != self.estimators_.shape[0]:
   1037             self.estimators_ = self.estimators_[:n_stages]
   1038             self.train_score_ = self.train_score_[:n_stages]

...........................................................................
C:\ProgramData\Anaconda3\envs\i3\lib\site-packages\sklearn\ensemble\gradient_boosting.py in _fit_stages(self=GradientBoostingRegressor(alpha=0.9, criterion='...le=1.0, verbose=0, warm_start=False), X=array([[ -1.00000000e+00,  -2.00000003e-01,  -1.....04051296e+09,   1.89901295e+01]], dtype=float32), y=memmap([ 27.,  35., -78., ...,   9.,  21.,  20.], dtype=float32), y_pred=array([[ 26.99980487], [ 34.99961777], ...], [ 21.00002664], [ 19.99974011]]), sample_weight=array([ 1.,  1.,  1., ...,  1.,  1.,  1.], dtype=float32), random_state=<mtrand.RandomState object>, begin_at_stage=0, mo
nitor=None, X_idx_sorted=array([[    0,  8619,     0, ...,   859,   869, ...[ 6499,  7945,  6499, ..., 11039, 10053,  8487]]))
   1084                                       sample_weight[~sample_mask])
   1085
   1086             # fit next stage of trees
   1087             y_pred = self._fit_stage(i, X, y, y_pred, sample_weight,
   1088                                      sample_mask, random_state, X_idx_sorted,
-> 1089                                      X_csc, X_csr)
        X_csc = None
        X_csr = None
   1090
   1091             # track deviance (= loss)
   1092             if do_oob:
   1093                 self.train_score_[i] = loss_(y[sample_mask],

...........................................................................
C:\ProgramData\Anaconda3\envs\i3\lib\site-packages\sklearn\ensemble\gradient_boosting.py in _fit_stage(self=GradientBoostingRegressor(alpha=0.9, criterion='...le=1.0, verbose=0, warm_start=False), i=122, X=array([[ -1.00000000e+00,  -2.00000003e-01,  -1.....04051296e+09,   1.89901295e+01]], dtype=float32), y=memmap([ 27., 35., -78., ...,   9.,  21.,  20.], dtype=float32), y_pred=array([[ 26.99980487], [ 34.99961777], ...], [ 21.00002664], [ 19.99974011]]), sample_weight=array([ 1.,  1.,  1., ...,  1.,  1.,  1.], dtype=float32), sample_mask=array([ True,  True,  True, ...,  True,  True,  True], dtype=bool), random_state=<mtrand.RandomState object>, X_idx_sorted=array([[    0,  8619,     0, ...,   859,   869, ...[ 6499,  7945,  6499, ..., 11039, 10
053,  8487]]), X_csc=None, X_csr=None)
    793                                              sample_weight, sample_mask,

    794                                              self.learning_rate, k=k)
    795             else:
    796                 loss.update_terminal_regions(tree.tree_, X, y, residual, y_pred,
    797                                              sample_weight, sample_mask,
--> 798                                              self.learning_rate, k=k)
        self.learning_rate = 0.28522087352060566
        k = 0
    799
    800             # add tree to ensemble
    801             self.estimators_[i, k] = tree
    802

...........................................................................
C:\ProgramData\Anaconda3\envs\i3\lib\site-packages\sklearn\ensemble\gradient_boosting.py in update_terminal_regions(self=<sklearn.ensemble.gradient_boosting.HuberLossFunction object>, tree=<sklearn.tree._tree.Tree object>, X=array([[ -1.00000000e+00,  -2.00000003e-01,  -1.....04051296e+09,   1.89901295e+01]], dtype=float32), y=memmap([ 27.,  35., -78., ...,   9.,  21.,  20.], dtype=float32), residual=array([  1.95126553e-04,   3.82230297e-04,   9.5...1895189e-06,  -2.66392852e-05,   2.59891505e-04]), y_pred=array([[ 26.99980487], [ 34.99961777], ...], [ 21.00002664], [ 19.99974011]]), sample_weight=array([ 1.,  1.,  1., ...,  1.,  1.,  1.], dtype=float32), sample_mask=array([ True,  True,  True, ...,  True,  True,  True], dtype=bool), learning_rate=0.28522087352060566, k=0)
    244
    245         # update each leaf (= perform line search)
    246         for leaf in np.where(tree.children_left == TREE_LEAF)[0]:
    247             self._update_terminal_region(tree, masked_terminal_regions,
    248                                          leaf, X, y, residual,
--> 249                                          y_pred[:, k], sample_weight)
        y_pred = array([[ 26.99980487], [ 34.99961777], ...], [ 21.00002664], [ 19.99974011]])
        k = 0
        sample_weight = array([ 1.,  1.,  1., ...,  1.,  1.,  1.], dtype=float32
)
    250
    251         # update predictions (both in-bag and out-of-bag)
    252         y_pred[:, k] += (learning_rate
    253                          * tree.value[:, 0, 0].take(terminal_regions, ax
is=0))

...........................................................................
C:\ProgramData\Anaconda3\envs\i3\lib\site-packages\sklearn\ensemble\gradient_boosting.py in _update_terminal_region(self=<sklearn.ensemble.gradient_boosting.HuberLossFunction object>, tree=<sklearn.tree._tree.Tree object>, terminal_regions=array([237, 237,  86, ..., 237, 237, 237], dtype=int64), leaf=260, X=array([[ -1.00000000e+00,  -2.00000003e-01,  -1.....04051296e+09,   1.89901295e+01]], dtype=float32), y=memmap([ 27.,  35., -78., ...,   9.,  21.,  20.], dtype=float32), residual=array([  1.95126553e-04,   3.82230297e-04,   9.5...1895189e-06,  -2.66392852e-05,   2.59891505e-04]), pred=array([ 26.99980487,  34.99961777, -78.03339049,...  9.00000162, 21.00002664,  19.99974011]), sample_weight=array([], dtype=float32))
    385         terminal_region = np.where(terminal_regions == leaf)[0]
    386         sample_weight = sample_weight.take(terminal_region, axis=0)
    387         gamma = self.gamma
    388         diff = (y.take(terminal_region, axis=0)
    389                 - pred.take(terminal_region, axis=0))
--> 390         median = _weighted_percentile(diff, sample_weight, percentile=50)
        median = undefined
        diff = array([], dtype=float64)
        sample_weight = array([], dtype=float32)
    391         diff_minus_median = diff - median
    392         tree.value[leaf, 0] = median + np.mean(
    393             np.sign(diff_minus_median) *
    394             np.minimum(np.abs(diff_minus_median), gamma))

...........................................................................
C:\ProgramData\Anaconda3\envs\i3\lib\site-packages\sklearn\utils\stats.py in _weighted_percentile(array=array([], dtype=float64), sample_weight=array([], dtype=float32), percentile=50)
     17     Compute the weighted ``percentile`` of ``array`` with ``sample_weight``.
     18     """
     19     sorted_idx = np.argsort(array)
     20
     21     # Find index of median prediction for each sample
---> 22     weight_cdf = stable_cumsum(sample_weight[sorted_idx])
        weight_cdf = undefined
        sample_weight = array([], dtype=float32)
        sorted_idx = array([], dtype=int64)
     23     percentile_idx = np.searchsorted(
     24         weight_cdf, (percentile / 100.) * weight_cdf[-1])
     25     return array[sorted_idx[percentile_idx]]

...........................................................................
C:\ProgramData\Anaconda3\envs\i3\lib\site-packages\sklearn\utils\extmath.py in stable_cumsum(arr=array([], dtype=float32), axis=None, rtol=1e-05, atol=1e-08)
    757     if np_version < (1, 9):
    758         return np.cumsum(arr, axis=axis, dtype=np.float64)
    759
    760     out = np.cumsum(arr, axis=axis, dtype=np.float64)
    761     expected = np.sum(arr, axis=axis, dtype=np.float64)
--> 762     if not np.all(np.isclose(out.take(-1, axis=axis), expected, rtol=rto
l,
        out.take = <built-in method take of numpy.ndarray object>
        axis = None
        expected = 0.0
        rtol = 1e-05
        atol = 1e-08
    763                              atol=atol, equal_nan=True)):
    764         warnings.warn('cumsum was found to be unstable: '
    765                       'its last element does not correspond to sum',
    766                       RuntimeWarning)

IndexError: cannot do a non-empty take from an empty axes.

Issue Analytics

State:
Created 6 years ago
Comments:14 (10 by maintainers)

Top GitHub Comments

1reaction

jpeg729commented, Dec 14, 2017

They say that RandomForests and friends don’t need data normalisation, but they forget to mention that you have to be careful with near infinite values.

This will provoke the error 9 times out of 10.

import numpy as np
from sklearn.ensemble import GradientBoostingRegressor

data = np.random.randn(100, 100)*1e38
data = np.nan_to_num(data.astype('float32'))
X = data[:, :-1]
y = data[:, -1]
GradientBoostingRegressor(loss="huber").fit(X, y)

As far as I can see, the only way a GradientBoostingRegressor can have an empty leaf is if the DecisionTreeRegressor returns a tree with an empty leaf, and sure enough running the following code often reveals the presence of several empty leaves.

import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree._tree import TREE_LEAF

data = np.random.randn(100, 100)*1e38
data = np.nan_to_num(data.astype('float32'))
X = data[:, :-1]
y = data[:, -1]
tree = DecisionTreeRegressor().fit(X,y)
terminal_regions = tree.apply(X)
count_empty = 0
for leaf in np.where(tree.tree_.children_left == TREE_LEAF)[0]:
    count_empty += len(np.where(terminal_regions == leaf)[0]) == 0

print(count_empty)

Possible solutions

Fix DecisionTreeRegressor so that it never returns a tree with an empty leaf even if some of its input values are near infinite.
Add a warning to the docs telling people to make sure their dataset contains no near infinite values.

0reactions

lestevecommented, Jan 22, 2018

Sure!