Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve HGBT leaf values updates

See original GitHub issue

Describe the workflow you want to enable

This issue references this TODO comment:

https://github.com/scikit-learn/scikit-learn/blob/be89eb75f250dc5a769281939ba01e570fb12ae1/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py#L65-L67

Describe your proposed solution

Some APIs (like std::nth_element) allows getting the median or a given quantile of a contiguous buffer in O(n) but it does need to mutate a data-structure to sort (it can be the buffer or another data-structure if using a Comparator).

Could it be used there?

Moreover, why can’t we compute the median in parallel here?

Additional context

Follows-up with discussions: https://github.com/scikit-learn/scikit-learn/pull/20811/files/4df17828e439ad09a7784e99cc3d8d956eb50fe0#r781036357

/cc @NicolasHug who might be interested in following this issue as he initially worked on the update and authored this comment.

Issue Analytics

State:
Created 2 years ago
Comments:13 (13 by maintainers)

Top GitHub Comments

1reaction

NicolasHugcommented, Jan 13, 2022

We can probably work around this “list of arrays of different sizes” issue by doing something similar to https://github.com/scikit-learn/scikit-learn/blob/5d7dc4ba327c138cf63be5cd9238200037c1eb13/sklearn/ensemble/_hist_gradient_boosting/_gradient_boosting.pyx (in particular the start/stop logic)

will force the whole HGBT Cython base to be C++.

I’m not sure what you mean by that, could you clarify? Do you mean that the Cython code will be compiled to C++ instead of C?

0reactions

jjerphancommented, Jan 22, 2022

@jjerphan Does your profiling indicate other room for improvements?

We could at least remove the TODO comment hinting at this issue.

I think the best way to tell is to try optimizing it.

For what I recall of my exploration of the OpenMP parallel sections (namely GOMP_parallel, as shown above which are large portions of the profiling report) a few months, it seems that they can’t really be improved much (those are nogil-context and the instructions are simple and optimized there).

Top Results From Across the Web

sklearn.ensemble.HistGradientBoostingRegressor

This is used as a multiplicative factor for the leaves values. ... scores are better than the n_iter_no_change - 1 -th-to-last one, up...

What is an intuitive interpretation of the leaf values in XGBoost ...

Some answer I found indicates that the values are "Conditional Probabilities" for a data sample to be on that leaf. But I also...

Understanding Your Blood Test Lab Results

Normal range is 70 to 99 mg/dL for most adults, although values can ... Hgb (Hemoglobin): Hemoglobin is a protein found on red...

Hemoglobinopathies: Current Practices for Screening, ...

Develop a training program for implementing laboratory technology in state ... A sample is collected before the newborn leaves the hospital and identifies....

Olive Leaf Extract Lower Blood Sugar

Heymans Well put it if men want the future woman to be taller than the how does exercise help type 2 diabetes current...