Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Path for pluggable low-level computational routines

See original GitHub issue

The goal of this issue is to discuss the design and prototype a way to register alternative implementations for core low level routines in scikit-learn, in particular to benefit from hardware optimized implementations (e.g. using GPUs efficiently).

Motivation

scikit-learn aims to provide reasonably easy to maintain and portable implementations of standard machine learning algorithms. Those implementations are typically written in Python (with the help of NumPy and SciPy) or in Cython when the overhead of the Python interpreter prevents us to efficiently implement algorithms with (nested) tight loops. This allows us reasonably fast implementations as binary packages (installable with pip/PyPI, conda/conda-forge, conda/anaconda or Linux distros) for a variety of platforms (Linux / macOS / Windows) x (x86_64, i686, arm64, ppc64le) from a single code-base with no external runtime dependencies beyond Python, NumPy and SciPy.

Recently, GPU hardware have proven very competitive for many machine learning related workloads, either from a pure latency standpoint, or from the standpoint of a better computation/energy trade-off (irrespective of the raw speed considerations). However, hardware optimized implementation are typically not portable and mandates additional dependencies.

We therefore propose to design a way for our users to register alternative implementations of low-level computation routines in scikit-learn provided they has installed the required extension package(s) that matches their specific hardware.

Relationship to adopting the Array API spec

This proposal is related and complementary to another effort, namely:

#22352

The Array API spec is useful to make it possible to have some scikit-learn estimators written using a pure numpy syntax to delegate the computation to alternative Array API compatible libraries such as CuPy.

However, some algorithms in scikit-learn cannot be efficiently written using NumPy operation only, for instance the main K-Means loop is written in Cython to process chunks of samples in parallel (using prange and OpenMP), compute distance with centroids and reduce those distances to find assign each sample to its closest centroid on-the-fly while preventing unnecessary memory transfer between CPU cache and RAM.

If we want to run this algorithm efficiently on GPU hardware, one would need to dispatch the computation of this low level function to an alternative implementation that can work on GPU, either written in C/C++ with GPU-specific supporting runtime libraries and compilers (e.g. OpenCL, NVIDIA Cuda, Intel oneAPI DPC++, AMD ROCm…) or using a Python syntax with the help of GPU support provided in numba for instance.

List of candidate routines

the main k-means loop (pairwise distances between samples and centroids with on-the-fly argmin reduction)
the core k-nearest neighbors computation loop (pairwise distances with on-the-fly arg-k-min reduction)
pairwise distances (without arg-k-min reduction)
pairwise kernels computation (e.g. for use in Nystroem)
…

Explicit registration API design ideas

I started to draft some API ideas in:

https://hackmd.io/@4rHCRgfySZSdd5eMtfUJiA/S1RDJ3HCF

Feel free to comment here, or there.

This design is subject to evolve, in particular to make it possible to register both Array API extensions and non-Array API extensions with the same registration API.

Next steps

start a proof of concept implementation in a branch + external repo using either numba or C++ for k-means. Edit: this is happening here:
- #24497
- https://github.com/soda-inria/sklearn-numba-dpex
once the API design has converged on the PoC, write a formal SLEP.

Issue Analytics

State:
Created 2 years ago
Reactions:3
Comments:28 (27 by maintainers)

Top GitHub Comments

5reactions

dantegdcommented, Apr 15, 2022

From the developers of cuML, we think this is a great proposal to improve user’s experience, and extend Scikit-learn without impacting it’s ease of use, and we’d love to collaborate and contribute towards making it happen. The main advantages of this approach as we see it mirror what @ogrisel says, when compared to just having libraries following the Scikit-learn APIs but being separate, are: ensuring a more consistent user experience, reduce barrier of entry (still using scikit-learn proper with an option as opposed to a new library), and discoverability/documentation.

There are quite a few elements where we would like to give our feedback based on the past few years of developing a scikit-learn-like library for GPUs. First, I think the API that probably would have the least need for maintenance from Scikit-learn itself is indeed:

# Set computational backend for training
clf = KNeighborsClassifier(n_neighbors=5).set_computational_engine("sklearn_dppy")
clf.fit(X_train, y_train)

using _validate_data as @thomasjpfan mentioned, this is due to a number of things:

It builds on top of the Scikit-learn API design and how many projects (like us, and others in scikit-contrib) already have built towards those APIs.
It moves the maintenance complexity towards the backends, as long as there is clarity on the API contracts (which Scikit-learn has always been fantastic at establishing already).

I think that second point is particularly important to make the effort to easily be adoptable by future libraries that might use different types of hardware. Today, for cuML for example, that means it’s on us to accept NumPy/CPU objects and we do the transferences to device and back, which is something we’ve learnt we already had to support due to user’s expectations anyways.

That said, the mechanism could be even more powerful if the pipeline machinery in Scikit-learn could relax some validations so that memory transferences could be minimized in pipelines like:

pipe = make_pipeline(
    PCA().set_computational_engine("sklearn_cuml"),
    LogisticRegression()  # use default scikit-learn
)

Perhaps a mechanism that register’s what’s the “preferred” device/format of a computational engine, so that if there are multiple consecutive algorithms in the same device, the data doesn’t need to be transferred back and forth. One problem which we’ve had to address in cuML is how to minimize data transfers and conversions (for example, row major to column major). Generally, computational engines may have preferred memory formats for particular algorithms (e.g. row-major vs column-major), and so one thing we might want to think about is mechanisms to allow an engine to maintain data in its preferred location and format through several chained calls to that engine. Being able to register this “preference” allows backends to take advantage of this if desired, or just default to using NumPy/array arrays, so it is opt-in, which means it wouldn’t complicate engine development unless the engine needs it. It would also keep maintenance on the scikit-learn codebase side low, by keeping the bulk of that responsibility (and flexibility) on the engine side.

3reactions

ogriselcommented, Sep 23, 2022

For information: we had a quick chat this afternoon with @betatim and @fcharras where we discussed the current design proposed in #24497 and the choice to make engine activation explicit manual (at least) for now and not dependent on the type of the input container.

figure out “what is the actual problem we want to solve, and what are problems we don’t want to solve”

This is not 100% clear yet to me either. Any current API choice is subject to change as we start implementing engines and get some practical experience on their usability when we try to use them for “real-life”-ish datascience tasks.

We plan to organize public online meeting dedicated to the topic on the scikit-learn discord server in the coming weeks for those interested. We will announce that on the mailing list and maybe twitter.

Top Results From Across the Web

Managing Many Databases as One: Pluggable ... - Oracle

The basis for cloud computing is an approach that involves establishing flexibility (the ability to move resources around), reliability (continuous operation), ...

Integration and Deployment of a Distributed and Pluggable ...

The main goal of the usage of such case study cell is to depict how a generic client can access data generated from...

(PDF) GRIDKIT: Pluggable overlay networks for Grid computing

In the paper we describe the Gridkit middleware which augments the basic service-oriented architecture to address this particular deficiency. We particularly ...

State Joining and Splitting for the Symbolic Execution of Binaries

This work investigates a "state joining" approach to making symbolic execution more practical and describes the challenges of applying state joining to the ......

1. Introduction - Programming Hive [Book] - O'Reilly

Underneath this computation model is a distributed file system called the Hadoop Distributed Filesystem (HDFS). Although the filesystem is “pluggable,” ...