Path for Adopting the Array API spec
See original GitHub issueI have been experimenting with adopting the Array API spec into scikit-learn. The Array API is one way for scikit-learn to run on other hardware such as GPUs.
I have some POCs on my fork for LinearDiscriminantAnalysis and GaussianMixture. Overall, there is runtime performance benefit when running on CuPy compared to NumPy, as shown in notesbooks for LDA (14x improvement) and GMM (7x improvement).
Official acceptance of the Array API in numpy is tracked as NEP 47.
Proposed User API
Here is the proposed API for dispatching. We require the array to adopt the Array API standard and we have a configuration option to turn on Array API dispatching:
# Create Array API arrays following spec
import cupy.array_api as xp
X_cu = xp.asarray(X_np)
y_cu = xp.asarray(y_np)
# Configure scikit-learn to dispatch
from sklearn import set_config
set_config(array_api_dispatch=True)
# Dispatches using `array_api`
lda_cu = LinearDiscriminantAnalysis()
lda_cu.fit(X_cu, y_cu)
This way the user can decide between the old behavior of potentially casting to NumPy and the new behavior of using the array api if available.
Developer Experience
The Array API spec and the NumPy API overlaps in many cases, but there is API we use in NumPy and not in Array API. There are a few ways to bridge this gap while trying to keep a maintainable code base:
- Wrap the Array-API namespace object to make it look “more like NumPy”
- Wrap the NumPy module to make it look “more like ArrayAPI”
- Helper functions everyone
1 and 2 are not mutually exclusive. To demonstrate these options, I’ll do a case study on unique
. The Array API spec does not define a unique
function, but a unique_values
instead.
Wrap the Array-API namespace object to make it look “more like NumPy”
def check_y(y):
np, _ = get_namespace(y) # Returns _ArrayAPIWrapper or NumPy
classes = np.unique(y)
class _ArrayAPIWrapper:
def unique(self, x):
return self._array_namespace.unique_values(x)
Existing scikit-learn code does not need to change as much because the Array API “looks like NumPy”
Make NumPy object “look more like Array-API”
def check_y(y):
xp, _ = get_namespace(y) # Returns Array API namespace or _NumPyApiWrapper
classes = xp.unique_values(y)
class _NumPyApiWrapper:
def unique_values(self, x):
return np.unique(x)
We need to update scikit-learn to use these new functions from the Array API spec.
Helper functions everyone
def check_y(y):
classes = _unique_values(y)
def _unique_values(x):
xp, is_array_api = get_namespace(x)
if is_array_api:
return xp.unique_values(x)
return np.unique(x)
We need to update scikit-learn to use these helper functions when API diverges. Some notable functions that needs some wrapper or helper functions include concat
, astype
, asarray
, unique
, errstate
, may_share_memory
, etc.
For my POCs, I went with a mostly option 1 where I wrapped Array API to look like NumPy. (I did wrap NumPy once to get np.dtype
, which is the same as array.astype
).
Other API considerations
- Type promotion is more strict with Array API
import numpy.array_api as xp
X = xp.asarray([1])
y = xp.asarray([1.0])
# fails
X + y
- No method chaining. (Array API arrays do not have methods on them)
(X.mean(axis=1) > 1.0).any()
# becomes
xp.any(xp.mean(X, axis=1) > 1.0)
- Array API has no concept of order
- Array API does not have integer indexing with
__getitem__
, alternative istake
which is going into the Array API spec. - No views into arrays in the Array API spec
- Can not support
Dask
orJAX
at first because of they do not support methods that have Data-dependent output shapes such asunique
.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:7
- Comments:31 (28 by maintainers)
Top GitHub Comments
It either
TypeError
orValueError
, depending on which CuPy array you pass in.If one passes in an “Array API compatible CuPy array”, a
ValueError
is raised.check_array
callsnumpy.asarary
which wraps the array into an object:If one passes a “normal cupy array” the
asarray
will fail with a TypeError because cupy does not allow silent copies:The errors raised are very specific to CuPy’s Array API implementation. Other Array API implementations may result in different errors or silently convert. The Array API spec does not say how
numpy.asarray
should work. NumPy defines__array__
in their Array API compatible array to makenumpy.asarray
work.For example, calling
numpy.asarray
on an Array API compatible NumPy array, results in a silent conversion to anumpy.ndarray
:Note that
numpy.array_api
arrays follow the Array API spec, whilenumpy.ndarray
does not. In other words, functions innumpy.array_api
does not work onnumpy.ndarray
.In the long run, I think we can away with not using the flag. If a user wants to opt out of Array API, they can convert their arrays to
numpy.ndarray
and pass it into scikit-learn.In the short term, I think we need the flag so users can opt into experimental behavior.
This makes a lot of sense, and I think it’s feasible. For context: immediately having NumPy support the array API standard in its main namespace was the initial goal when we started writing NEP 47. There were a few incompatible behaviors in the
ndarray
object that made this hard in the short term (casting related for example), so we reluctantly switched to a separatenumpy.array_api
namespace. However, we should now revisit making the main namespace as compatible as possible (new functions like theunique_*
ones can be easily added for example). And longer-term, the behaviors in the array API are preferred also fornumpy.ndarray
and we could over time get to full or almost-full compatibilty. For example, one key issue is value-based casting - and there’s now an experimental effort to try and get rid of that. It’ll be painful and take quite a while, but it should be doable.