question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve HGBT leaf values updates

See original GitHub issue

Describe the workflow you want to enable

This issue references this TODO comment:

https://github.com/scikit-learn/scikit-learn/blob/be89eb75f250dc5a769281939ba01e570fb12ae1/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py#L65-L67

Describe your proposed solution

Some APIs (like std::nth_element) allows getting the median or a given quantile of a contiguous buffer in O(n) but it does need to mutate a data-structure to sort (it can be the buffer or another data-structure if using a Comparator).

Could it be used there?

Moreover, why can’t we compute the median in parallel here?

Additional context

Follows-up with discussions: https://github.com/scikit-learn/scikit-learn/pull/20811/files/4df17828e439ad09a7784e99cc3d8d956eb50fe0#r781036357


/cc @NicolasHug who might be interested in following this issue as he initially worked on the update and authored this comment.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
NicolasHugcommented, Jan 13, 2022

We can probably work around this “list of arrays of different sizes” issue by doing something similar to https://github.com/scikit-learn/scikit-learn/blob/5d7dc4ba327c138cf63be5cd9238200037c1eb13/sklearn/ensemble/_hist_gradient_boosting/_gradient_boosting.pyx (in particular the start/stop logic)

will force the whole HGBT Cython base to be C++.

I’m not sure what you mean by that, could you clarify? Do you mean that the Cython code will be compiled to C++ instead of C?

0reactions
jjerphancommented, Jan 22, 2022

@jjerphan Does your profiling indicate other room for improvements?

We could at least remove the TODO comment hinting at this issue.

I think the best way to tell is to try optimizing it.

For what I recall of my exploration of the OpenMP parallel sections (namely GOMP_parallel, as shown above which are large portions of the profiling report) a few months, it seems that they can’t really be improved much (those are nogil-context and the instructions are simple and optimized there).

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.ensemble.HistGradientBoostingRegressor
This is used as a multiplicative factor for the leaves values. ... scores are better than the n_iter_no_change - 1 -th-to-last one, up...
Read more >
What is an intuitive interpretation of the leaf values in XGBoost ...
Some answer I found indicates that the values are "Conditional Probabilities" for a data sample to be on that leaf. But I also...
Read more >
Understanding Your Blood Test Lab Results
Normal range is 70 to 99 mg/dL for most adults, although values can ... Hgb (Hemoglobin): Hemoglobin is a protein found on red...
Read more >
Hemoglobinopathies: Current Practices for Screening, ...
Develop a training program for implementing laboratory technology in state ... A sample is collected before the newborn leaves the hospital and identifies....
Read more >
Olive Leaf Extract Lower Blood Sugar
Heymans Well put it if men want the future woman to be taller than the how does exercise help type 2 diabetes current...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found