Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Path for Adopting the Array API spec

See original GitHub issue

I have been experimenting with adopting the Array API spec into scikit-learn. The Array API is one way for scikit-learn to run on other hardware such as GPUs.

I have some POCs on my fork for LinearDiscriminantAnalysis and GaussianMixture. Overall, there is runtime performance benefit when running on CuPy compared to NumPy, as shown in notesbooks for LDA (14x improvement) and GMM (7x improvement).

Official acceptance of the Array API in numpy is tracked as NEP 47.

Proposed User API

Here is the proposed API for dispatching. We require the array to adopt the Array API standard and we have a configuration option to turn on Array API dispatching:

# Create Array API arrays following spec
import cupy.array_api as xp
X_cu = xp.asarray(X_np)
y_cu = xp.asarray(y_np)

# Configure scikit-learn to dispatch
from sklearn import set_config
set_config(array_api_dispatch=True)

# Dispatches using `array_api`
lda_cu = LinearDiscriminantAnalysis()
lda_cu.fit(X_cu, y_cu)

This way the user can decide between the old behavior of potentially casting to NumPy and the new behavior of using the array api if available.

Developer Experience

The Array API spec and the NumPy API overlaps in many cases, but there is API we use in NumPy and not in Array API. There are a few ways to bridge this gap while trying to keep a maintainable code base:

Wrap the Array-API namespace object to make it look “more like NumPy”
Wrap the NumPy module to make it look “more like ArrayAPI”
Helper functions everyone

1 and 2 are not mutually exclusive. To demonstrate these options, I’ll do a case study on unique. The Array API spec does not define a unique function, but a unique_values instead.

Wrap the Array-API namespace object to make it look “more like NumPy”

def check_y(y):
    np, _ = get_namespace(y)  # Returns _ArrayAPIWrapper or NumPy
    classes = np.unique(y)

class _ArrayAPIWrapper:
    def unique(self, x):
        return self._array_namespace.unique_values(x)

Existing scikit-learn code does not need to change as much because the Array API “looks like NumPy”

Make NumPy object “look more like Array-API”

def check_y(y):
    xp, _ = get_namespace(y)  # Returns Array API namespace or _NumPyApiWrapper
    classes = xp.unique_values(y)

class _NumPyApiWrapper:
    def unique_values(self, x):
        return np.unique(x)

We need to update scikit-learn to use these new functions from the Array API spec.

Helper functions everyone

def check_y(y):
    classes = _unique_values(y)

def _unique_values(x):
    xp, is_array_api = get_namespace(x)
    if is_array_api:
        return xp.unique_values(x)
    return np.unique(x)

We need to update scikit-learn to use these helper functions when API diverges. Some notable functions that needs some wrapper or helper functions include concat, astype, asarray, unique, errstate, may_share_memory, etc.

For my POCs, I went with a mostly option 1 where I wrapped Array API to look like NumPy. (I did wrap NumPy once to get np.dtype, which is the same as array.astype).

CC @scikit-learn/core-devs

Other API considerations

Type promotion is more strict with Array API

import numpy.array_api as xp
X = xp.asarray([1])
y = xp.asarray([1.0])

# fails
X + y

No method chaining. (Array API arrays do not have methods on them)

(X.mean(axis=1) > 1.0).any()

# becomes
xp.any(xp.mean(X, axis=1) > 1.0)

Array API has no concept of order
Array API does not have integer indexing with __getitem__, alternative is take which is going into the Array API spec.
No views into arrays in the Array API spec
Can not support Dask or JAX at first because of they do not support methods that have Data-dependent output shapes such as unique.

Issue Analytics

State:
Created 2 years ago
Reactions:7
Comments:31 (28 by maintainers)

Top GitHub Comments

6reactions

thomasjpfancommented, Feb 1, 2022

What is the current behavior when passing a CuPy array to LinearDiscriminantAnalysis in scikit-learn 1.0?

It either TypeError or ValueError, depending on which CuPy array you pass in.

If one passes in an “Array API compatible CuPy array”, a ValueError is raised. check_array calls numpy.asarary which wraps the array into an object:

import cupy.array_api as cu_xp
import numpy

X = cu_xp.asarray([1, 2, 3])

# numpy object scalar and `check_array` will `ValueError`
print(numpy.asarray(X).dtype)
# object

If one passes a “normal cupy array” the asarray will fail with a TypeError because cupy does not allow silent copies:

import cupy
import numpy

X = cupy.asarray([1, 2, 3])

# TypeError: Implicit conversion to a NumPy array is not allowed.
numpy.asarray(X)

The errors raised are very specific to CuPy’s Array API implementation. Other Array API implementations may result in different errors or silently convert. The Array API spec does not say how numpy.asarray should work. NumPy defines __array__ in their Array API compatible array to make numpy.asarray work.

For example, calling numpy.asarray on an Array API compatible NumPy array, results in a silent conversion to a numpy.ndarray:

import numpy.array_api as np_xp
import numpy

X = np_xp.asarray([1, 2, 3])
X_convert = numpy.asarray(X)

print(type(X_convert), X_convert.dtype)
# <class 'numpy.ndarray'> int64

Note that numpy.array_api arrays follow the Array API spec, while numpy.ndarray does not. In other words, functions in numpy.array_api does not work on numpy.ndarray.

Do you think that in the long run, we could get away without the config flag “set_config(array_api_dispatch=True)” ?

In the long run, I think we can away with not using the flag. If a user wants to opt out of Array API, they can convert their arrays to numpy.ndarray and pass it into scikit-learn.

In the short term, I think we need the flag so users can opt into experimental behavior.

5reactions

rgommerscommented, Feb 14, 2022

My point was indeed that the current status of the ecosystem puts us in an uncomfortable situation. A more comfortable one would be if a compromise could be reached that would enable numpy to implement the array API, so that we could code for the array API.

I personally agree that it would be great if the main NumPy namespace eventually converged to the array API, at least in the places where it wouldn’t require major compatibility breaks.

Yes, if the NumPy namespace can adopt the all Array API functions, it will become easier for scikit-learn to adopt the spec.

This makes a lot of sense, and I think it’s feasible. For context: immediately having NumPy support the array API standard in its main namespace was the initial goal when we started writing NEP 47. There were a few incompatible behaviors in the ndarray object that made this hard in the short term (casting related for example), so we reluctantly switched to a separate numpy.array_api namespace. However, we should now revisit making the main namespace as compatible as possible (new functions like the unique_* ones can be easily added for example). And longer-term, the behaviors in the array API are preferred also for numpy.ndarray and we could over time get to full or almost-full compatibilty. For example, one key issue is value-based casting - and there’s now an experimental effort to try and get rid of that. It’ll be painful and take quite a while, but it should be doable.

Top Results From Across the Web

Purpose and scope — Python array API standard 2021.12 ...

... Python Data API Standards to start drafting a specification for an array API that could be adopted by each of the existing...

NEP 47 — Adopting the array API standard - NumPy

We propose to adopt the Python array API standard, developed by the Consortium for Python Data API Standards.

OpenAPI Specification - Version 3.0.3 - Swagger

The OpenAPI Specification (OAS) defines a standard, language-agnostic interface to RESTful APIs which allows both humans and computers to discover and ...

Partial document update in Azure Cosmos DB | Microsoft Learn

Partial document update operation is based on the RFC spec. ... If the target path is a valid array index, a new element...

Update a Firestore document containing an array field

DocumentReference washingtonRef = db.Collection("cities").Document("DC"); // Atomically add a new region to the "regions" array field. await washingtonRef.