question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[MAINT] Modularize Tree code and Splitter utility functions

See original GitHub issue

From #20819 , developers expressed issues with the current tree code.

Part of that is the modularity and as a result, maintainability/upgradability of such code. I propose the following super-short refactors to the _tree.pyx/pxd and _splitter.pyx/pxd files. This would be the first in a series of PRs to demonstrate that #20819 is fairly straightforward.

Tree class

The Tree class assumes axis-aligned splits. However, by modularizing the parts where the node values are set, and the feature values are computed for any given dataset, then any subclass of Tree can easily redefine only these two functions and a new Splitter to enable a “new” type of Tree.

I propose adding the following two functions to the Tree class and altering _add_node(), _apply_dense to accompany these changes:

    cdef int _set_node_values(self, SplitRecord split_node,
            Node *node) nogil except -1:
        """Set node data.
        """
        node.feature = split_node.feature
        node.threshold = split_node.threshold
        return 1
    
  cdef DTYPE_t _compute_feature(self, const DTYPE_t[:] X_ndarray,
            Node *node) nogil:
        """Compute feature from a given data matrix, X.

        In axis-aligned trees, this is simply the value in the column of X
        for this specific feature.
        """
        # the feature index
        cdef DTYPE_t feature = X_ndarray[node.feature]
        return feature

Splitter

Splitter uses functions only defined in the .pyx files. As a result, they are not available via cimport. This poses an issue for #20819 and also for downstream packages that might want to define a new splitter that subclasses Splitter.

Here I propose adding the following functions into the _splitter.pxd file:

cdef inline void sort(DTYPE_t* Xf, SIZE_t* samples, SIZE_t n) nogil
cdef inline void swap(DTYPE_t* Xf, SIZE_t* samples, SIZE_t i, SIZE_t j) nogil
# and the other splitter utility functions.
...
`

### Misc Notes

This specifically addresses only issues with dense arrays. A follow-on issue and PR would be necessary for sparse arrays.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
thomasjpfancommented, Mar 11, 2022

I’m working on replacing sort all together. It requires minor refactoring of the splitter internals and benchmarks to make sure there are no performance regressions.

0reactions
jjerphancommented, Mar 11, 2022
Read more comments on GitHub >

github_iconTop Results From Across the Web

2. Building and Running Modules - Linux Device Drivers, 3rd ...
The role of a module is to extend kernel functionality; modularized code runs in kernel space. Usually a driver performs both the tasks...
Read more >
Module Creation - Recommended Pattern | Terraform
Learn the architectural recommendations for module creation distilled from engagements with large enterprises using Terraform. Use Terraform module best ...
Read more >
Modular packages in AWS SDK for JavaScript
Modular imports now allow us to use code-splitting, which reduces the amount of code needed during the initial load. This reduces the main ......
Read more >
Scaling your Redux App with ducks - freeCodeCamp
In a large scale application, your state tree will be at least 3 level deep. Reducer functions should be as small as possible...
Read more >
Composable and Modular Code Generation in MLIR - arXiv
In particular, C and C++ ABI compliance at function boundary eases ... Advantages include reduced complexity and maintenance cost while also ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found