Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

check_array forces finite values for non-numeric data

See original GitHub issue

Description

When check_array is given dtype=None, it still tries to force finite values.

Steps/Code to Reproduce

import numpy  as np
import pandas as pd
from sklearn.utils import check_array

x = pd.DataFrame({
    'val': [
        np.zeros((1, 2))
    ]
}).values

check_array(x, dtype=None)
check_array(x, dtype=object) # This fails too.

Expected Results

No error is thrown. Values are not checked for finiteness. or No error is throw. Values are checked for finiteness.

Actual Results

 File "error.py", line 14, in <module>
    check_X_y(x, y, dtype=None)#, force_all_finite=False)
  File ".../sklearn/utils/validation.py", line 719, in check_X_y
    estimator=estimator)
  File ".../sklearn/utils/validation.py", line 542, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File ".../sklearn/utils/validation.py", line 59, in _assert_all_finite
    if _object_dtype_isnan(X).any():
AttributeError: 'bool' object has no attribute 'any'

Versions

System:
    python: 3.7.3 (default, Mar 27 2019, 22:11:17)  [GCC 7.3.0]
executable: /opt/anaconda3/bin/python
   machine: Linux-3.10.0-957.21.3.el7.x86_64-x86_64-with-centos-7.6.1810-Core

BLAS:
    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
  lib_dirs: /opt/anaconda3/lib
cblas_libs: mkl_rt, pthread

Python deps:
       pip: 19.1.1
setuptools: 41.0.1
   sklearn: 0.21.3
     numpy: 1.16.4
     scipy: 1.3.0
    Cython: 0.29.12
    pandas: 0.24.2
Linux-3.10.0-957.21.3.el7.x86_64-x86_64-with-centos-7.6.1810-Core
Python 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0]
NumPy 1.16.4
SciPy 1.3.0
Scikit-Learn 0.21.2

Issue Analytics

State:
Created 4 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

1reaction

glemaitrecommented, Sep 23, 2019

If I take reported in imbalanced-learn, I think that this our implementation which fails. We just need to turn the flag off.

Since that our primary use case of object dtype is string, I think that we the current defaults are OK.

0reactions

rthcommented, Sep 23, 2019

check_array([‘xxx’, ‘yyy’, np.nan], dtype=object, ensure_2d=False)

That behaves consistently IMO, i.e. raises an error on NaN since force_all_finite=True, and doing that by default in a string array makes sense I think.

The question is whether we need to change the behavior for more complex object dtypes – seems likely.