question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Classifiers may not work with arrays defining __array_function__

See original GitHub issue

Description

With NEP-18, numpy functions that previously converted an array-like to an ndarray may no longer do the (implicit) conversion. dask.array recently implemented __array_function__ so np.unique(dask.array.Array) now returns a dask.array.Array.

Some more details in https://github.com/dask/dask-ml/issues/541

Steps/Code to Reproduce

import dask.array as da
import dask_ml.datasets
import sklearn.linear_model

X, y = dask_ml.datasets.make_classification(chunks=50)

clf = sklearn.linear_model.LogisticRegression()
clf.fit(X, y)

Expected Results

No error, the same output as clf.fit(X.compute(), y.compute()), or by setting the environment variable NUMPY_EXPERIMENTAL_ARRAY_FUNCTION='0'.

Actual Results

That raises

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-b0953fbb1d6e> in <module>
----> 1 clf.fit(X, y)

~/Envs/dask-dev/lib/python3.7/site-packages/sklearn/linear_model/logistic.py in fit(self, X, y, sample_weight)
   1536
   1537         multi_class = _check_multi_class(self.multi_class, solver,
-> 1538                                          len(self.classes_))
   1539
   1540         if solver == 'liblinear':

TypeError: 'float' object cannot be interpreted as an integer

This is because self.classes_ = np.unique(y) is a Dask Array with unknown length

In [2]: np.unique(da.arange(12))
Out[2]: dask.array<getitem, shape=(nan,), dtype=int64, chunksize=(nan,), chunktype=numpy.ndarray>

since Dask is lazy and doesn’t know the unique elements until compute time.

Versions

System:
    python: 3.7.3 (default, Apr  5 2019, 14:56:38)  [Clang 10.0.1 (clang-1001.0.46.3)]
executable: /Users/taugspurger/Envs/dask-dev/bin/python
   machine: Darwin-18.6.0-x86_64-i386-64bit

Python deps:
       pip: 19.2.1
setuptools: 41.0.1
   sklearn: 0.21.3
     numpy: 1.18.0.dev0+5e7e74b
     scipy: 1.2.0
    Cython: 0.29.9
    pandas: 0.25.0+169.g5de4e55d6

I think this needs need NumPy>=1.17 and Dask>=2.0.0


Possible solution: Explicitly convert array-likes to concrete ndarrays where necessary (this is a bit hard to determine though). For example https://github.com/scikit-learn/scikit-learn/blob/148491867920cc2af0e7e5700a0299be4a5d1c9f/sklearn/linear_model/logistic.py#L1517 would be self.classes_ = np.asarray(np.unique(y)). That may not be ideal for other libraries implementing __array_function__ (like pydata/sparse).

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:18 (18 by maintainers)

github_iconTop GitHub Comments

1reaction
shoyercommented, Aug 20, 2019

NumPy itself doesn’t really impose any restrictions on what you can do with __array_function__. I think it would be perfectly reasonable to error when length is NaN.

I would definitely coerce everything to NumPy arrays in check_array. As a first pass that’s definitely the right thing to do. I’m a little surprised that wasn’t happening already. Duck array support is something you want to add intentionally, not accidentally.

0reactions
TomAugspurgercommented, Apr 14, 2022

Thanks! I’ll try to take a look at https://github.com/dask/dask-ml/pull/910 soon.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Singleton array array(<function train at 0x7f3a311320d0 ...
This error arises because your function train masks your variable train , and hence it is passed to itself. Explanation: You define a ......
Read more >
Work with arrays | BigQuery - Google Cloud
You can construct arrays of simple data types, such as INT64 , and complex data types, such as STRUCT s. The current exception...
Read more >
R Array Function and Create Array in R - DataFlair
R arrays are the data objects which can store data in more than two dimensions. An array is created using the array() function....
Read more >
3.6. scikit-learn: machine learning in Python
A classification algorithm may be used to draw a dividing boundary between ... copy of the iris CSV file along with a function...
Read more >
An Introduction to R - The Comprehensive R Archive Network
Our introduction to the R environment did not mention statistics, ... These may be variables, arrays of numbers, character strings, functions, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found