Path for pluggable low-level computational routines
See original GitHub issueThe goal of this issue is to discuss the design and prototype a way to register alternative implementations for core low level routines in scikit-learn, in particular to benefit from hardware optimized implementations (e.g. using GPUs efficiently).
Motivation
scikit-learn aims to provide reasonably easy to maintain and portable implementations of standard machine learning algorithms. Those implementations are typically written in Python (with the help of NumPy and SciPy) or in Cython when the overhead of the Python interpreter prevents us to efficiently implement algorithms with (nested) tight loops. This allows us reasonably fast implementations as binary packages (installable with pip/PyPI, conda/conda-forge, conda/anaconda or Linux distros) for a variety of platforms (Linux / macOS / Windows) x (x86_64, i686, arm64, ppc64le) from a single code-base with no external runtime dependencies beyond Python, NumPy and SciPy.
Recently, GPU hardware have proven very competitive for many machine learning related workloads, either from a pure latency standpoint, or from the standpoint of a better computation/energy trade-off (irrespective of the raw speed considerations). However, hardware optimized implementation are typically not portable and mandates additional dependencies.
We therefore propose to design a way for our users to register alternative implementations of low-level computation routines in scikit-learn provided they has installed the required extension package(s) that matches their specific hardware.
Relationship to adopting the Array API spec
This proposal is related and complementary to another effort, namely:
The Array API spec is useful to make it possible to have some scikit-learn estimators written using a pure numpy syntax to delegate the computation to alternative Array API compatible libraries such as CuPy.
However, some algorithms in scikit-learn cannot be efficiently written using NumPy operation only, for instance the main K-Means loop is written in Cython to process chunks of samples in parallel (using prange
and OpenMP), compute distance with centroids and reduce those distances to find assign each sample to its closest centroid on-the-fly while preventing unnecessary memory transfer between CPU cache and RAM.
If we want to run this algorithm efficiently on GPU hardware, one would need to dispatch the computation of this low level function to an alternative implementation that can work on GPU, either written in C/C++ with GPU-specific supporting runtime libraries and compilers (e.g. OpenCL, NVIDIA Cuda, Intel oneAPI DPC++, AMD ROCm…) or using a Python syntax with the help of GPU support provided in numba for instance.
List of candidate routines
- the main k-means loop (pairwise distances between samples and centroids with on-the-fly argmin reduction)
- the core k-nearest neighbors computation loop (pairwise distances with on-the-fly arg-k-min reduction)
- pairwise distances (without arg-k-min reduction)
- pairwise kernels computation (e.g. for use in Nystroem)
- …
Explicit registration API design ideas
I started to draft some API ideas in:
Feel free to comment here, or there.
This design is subject to evolve, in particular to make it possible to register both Array API extensions and non-Array API extensions with the same registration API.
Next steps
- start a proof of concept implementation in a branch + external repo using either numba or C++ for k-means. Edit: this is happening here:
- once the API design has converged on the PoC, write a formal SLEP.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:3
- Comments:28 (27 by maintainers)
Top GitHub Comments
From the developers of cuML, we think this is a great proposal to improve user’s experience, and extend Scikit-learn without impacting it’s ease of use, and we’d love to collaborate and contribute towards making it happen. The main advantages of this approach as we see it mirror what @ogrisel says, when compared to just having libraries following the Scikit-learn APIs but being separate, are: ensuring a more consistent user experience, reduce barrier of entry (still using scikit-learn proper with an option as opposed to a new library), and discoverability/documentation.
There are quite a few elements where we would like to give our feedback based on the past few years of developing a scikit-learn-like library for GPUs. First, I think the API that probably would have the least need for maintenance from Scikit-learn itself is indeed:
using
_validate_data
as @thomasjpfan mentioned, this is due to a number of things:I think that second point is particularly important to make the effort to easily be adoptable by future libraries that might use different types of hardware. Today, for cuML for example, that means it’s on us to accept NumPy/CPU objects and we do the transferences to device and back, which is something we’ve learnt we already had to support due to user’s expectations anyways.
That said, the mechanism could be even more powerful if the pipeline machinery in Scikit-learn could relax some validations so that memory transferences could be minimized in pipelines like:
Perhaps a mechanism that register’s what’s the “preferred” device/format of a computational engine, so that if there are multiple consecutive algorithms in the same device, the data doesn’t need to be transferred back and forth. One problem which we’ve had to address in cuML is how to minimize data transfers and conversions (for example, row major to column major). Generally, computational engines may have preferred memory formats for particular algorithms (e.g. row-major vs column-major), and so one thing we might want to think about is mechanisms to allow an engine to maintain data in its preferred location and format through several chained calls to that engine. Being able to register this “preference” allows backends to take advantage of this if desired, or just default to using NumPy/array arrays, so it is opt-in, which means it wouldn’t complicate engine development unless the engine needs it. It would also keep maintenance on the scikit-learn codebase side low, by keeping the bulk of that responsibility (and flexibility) on the engine side.
For information: we had a quick chat this afternoon with @betatim and @fcharras where we discussed the current design proposed in #24497 and the choice to make engine activation explicit manual (at least) for now and not dependent on the type of the input container.
This is not 100% clear yet to me either. Any current API choice is subject to change as we start implementing engines and get some practical experience on their usability when we try to use them for “real-life”-ish datascience tasks.
We plan to organize public online meeting dedicated to the topic on the scikit-learn discord server in the coming weeks for those interested. We will announce that on the mailing list and maybe twitter.