Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

check_array(X, dtype='numeric') should fail if X has strings

See original GitHub issue

Currently, dtype=‘numeric’ is defined as “dtype is preserved unless array.dtype is object”. This seems overly lenient and strange behaviour, as in #9342 where @qinhanmin2014 shows that check_array(['a', 'b', 'c'], dtype='numeric') works without error and produces an array of strings! This behaviour is not tested and it’s hard to believe that it is useful and intended. Perhaps we need a deprecation cycle, but I think dtype=‘numeric’ should raise an error, or attempt to coerce, if the data does not actually have a numeric, real-valued dtype.

Issue Analytics

State:
Created 6 years ago
Comments:14 (11 by maintainers)

Top GitHub Comments

1reaction

zeromhcommented, Sep 28, 2020

I’m confused about the fix here. I’m using sklearn 0.23.2, and the behavior that @jnothman called out as a problem is still the same as he described. To reproduce:

arr = np.array([[1, 's'],
                [1, 1]])
check_array(arr, dtype='numeric')

FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers
 if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in 
scikit-learn, for example by using your_array = your_array.astype(np.float64).
  return f(**kwargs)

array([['1', 's'],
       ['1', '1']], dtype='<U21')

Now everything’s a string. It looks like the warning message was added in 2018 but the behavior was never changed. Am I missing something?

1reaction

jnothmancommented, Jan 17, 2018

We wouldn’t deprecate check_array entirely, but we would warn for two releases that “In the future, this data with dtype(‘Uxx’) would be rejected because it is not of a numeric dtype.”

Top Results From Across the Web

sklearn.utils.check_array — scikit-learn 1.2.0 documentation

The data name used to construct the error message. In particular if input_name is “X” and the data has NaN values and allow_nan...

check_array() got an unexpected keyword argument ...

so I'm trying to do Co-clustering Mod for my data, here is the code: ... 92 ---> 93 check_array(X, accept_sparse=True, dtype="numeric", ...

Source code for econml.sklearn_extensions.linear_model

This is necessary for their get_params to play nicely with some other ... Will be cast to X's dtype if necessary sample_weight :...

Python sklearn.utils.check_array() Examples

X = check_array(X, accept_sparse=["csr", "csc"]) if self.metric == "precomputed": check_is_fitted(self, "medoid_indices_") return X[:, self.medoid_indices_] ...

sklearn.utils.check_array() - Scikit-learn - W3cubDocs

New in version 0.20: force_all_finite accepts the string 'allow-nan' . ensure_2d : boolean (default=True). Whether to raise a value error if X is...