question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

check_array(X, dtype='numeric') should fail if X has strings

See original GitHub issue

Currently, dtype=‘numeric’ is defined as “dtype is preserved unless array.dtype is object”. This seems overly lenient and strange behaviour, as in #9342 where @qinhanmin2014 shows that check_array(['a', 'b', 'c'], dtype='numeric') works without error and produces an array of strings! This behaviour is not tested and it’s hard to believe that it is useful and intended. Perhaps we need a deprecation cycle, but I think dtype=‘numeric’ should raise an error, or attempt to coerce, if the data does not actually have a numeric, real-valued dtype.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:14 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
zeromhcommented, Sep 28, 2020

I’m confused about the fix here. I’m using sklearn 0.23.2, and the behavior that @jnothman called out as a problem is still the same as he described. To reproduce:

arr = np.array([[1, 's'],
                [1, 1]])
check_array(arr, dtype='numeric')

FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers
 if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in 
scikit-learn, for example by using your_array = your_array.astype(np.float64).
  return f(**kwargs)

array([['1', 's'],
       ['1', '1']], dtype='<U21')

Now everything’s a string. It looks like the warning message was added in 2018 but the behavior was never changed. Am I missing something?

1reaction
jnothmancommented, Jan 17, 2018

We wouldn’t deprecate check_array entirely, but we would warn for two releases that “In the future, this data with dtype(‘Uxx’) would be rejected because it is not of a numeric dtype.”

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.utils.check_array — scikit-learn 1.2.0 documentation
The data name used to construct the error message. In particular if input_name is “X” and the data has NaN values and allow_nan...
Read more >
check_array() got an unexpected keyword argument ...
so I'm trying to do Co-clustering Mod for my data, here is the code: ... 92 ---> 93 check_array(X, accept_sparse=True, dtype="numeric", ...
Read more >
Source code for econml.sklearn_extensions.linear_model
This is necessary for their get_params to play nicely with some other ... Will be cast to X's dtype if necessary sample_weight :...
Read more >
Python sklearn.utils.check_array() Examples
X = check_array(X, accept_sparse=["csr", "csc"]) if self.metric == "precomputed": check_is_fitted(self, "medoid_indices_") return X[:, self.medoid_indices_] ...
Read more >
sklearn.utils.check_array() - Scikit-learn - W3cubDocs
New in version 0.20: force_all_finite accepts the string 'allow-nan' . ensure_2d : boolean (default=True). Whether to raise a value error if X is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found