Improve HGBT leaf values updates
See original GitHub issueDescribe the workflow you want to enable
This issue references this TODO
comment:
Describe your proposed solution
Some APIs (like std::nth_element
) allows getting the median or a given quantile of a contiguous buffer in O(n) but it does need to mutate a data-structure to sort (it can be the buffer or another data-structure if using a Comparator).
Could it be used there?
Moreover, why can’t we compute the median in parallel here?
Additional context
Follows-up with discussions: https://github.com/scikit-learn/scikit-learn/pull/20811/files/4df17828e439ad09a7784e99cc3d8d956eb50fe0#r781036357
/cc @NicolasHug who might be interested in following this issue as he initially worked on the update and authored this comment.
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (13 by maintainers)
Top Results From Across the Web
sklearn.ensemble.HistGradientBoostingRegressor
This is used as a multiplicative factor for the leaves values. ... scores are better than the n_iter_no_change - 1 -th-to-last one, up...
Read more >What is an intuitive interpretation of the leaf values in XGBoost ...
Some answer I found indicates that the values are "Conditional Probabilities" for a data sample to be on that leaf. But I also...
Read more >Understanding Your Blood Test Lab Results
Normal range is 70 to 99 mg/dL for most adults, although values can ... Hgb (Hemoglobin): Hemoglobin is a protein found on red...
Read more >Hemoglobinopathies: Current Practices for Screening, ...
Develop a training program for implementing laboratory technology in state ... A sample is collected before the newborn leaves the hospital and identifies....
Read more >Olive Leaf Extract Lower Blood Sugar
Heymans Well put it if men want the future woman to be taller than the how does exercise help type 2 diabetes current...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
We can probably work around this “list of arrays of different sizes” issue by doing something similar to https://github.com/scikit-learn/scikit-learn/blob/5d7dc4ba327c138cf63be5cd9238200037c1eb13/sklearn/ensemble/_hist_gradient_boosting/_gradient_boosting.pyx (in particular the
start
/stop
logic)I’m not sure what you mean by that, could you clarify? Do you mean that the Cython code will be compiled to C++ instead of C?
I think the best way to tell is to try optimizing it.
For what I recall of my exploration of the OpenMP parallel sections (namely
GOMP_parallel
, as shown above which are large portions of the profiling report) a few months, it seems that they can’t really be improved much (those arenogil
-context and the instructions are simple and optimized there).