TypeError : Wrong type for parameter `n_values` in OneHotEncoder
See original GitHub issueSteps/Code to Reproduce
import numpy as np
from sklearn.preprocessing import OneHotEncoder
numerical_features = np.random.randint(10, size=(5,4))
categorical = np.array([2, 2, 3, 2, 3]).reshape(-1,1)
X = np.hstack((numerical_features, categorical))
onehotencoder = OneHotEncoder(categorical_features=[4],
handle_unknown='ignore')
X_encoded = onehotencoder.fit_transform(X)
Expected Results
No error should be thrown. OneHotEncoder
should work as legacy and encode only the supplied columns.
Actual Results
/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py:390: DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20 and will be removed in 0.22. You can use the ColumnTransformer instead.
"use the ColumnTransformer instead.", DeprecationWarning)
Traceback (most recent call last):
File "<ipython-input-15-c174bb78e628>", line 1, in <module>
runfile('/home/vivek/untitless.py', wdir='/home/vivek')
File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/spyder_kernels/customize/spydercustomize.py", line 668, in runfile
execfile(filename, namespace)
File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/spyder_kernels/customize/spydercustomize.py", line 108, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/home/vivek/untitless.py", line 24, in <module>
X_encoded = onehotencoder.fit_transform(X)
File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 514, in fit_transform
self._categorical_features, copy=True)
File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/sklearn/preprocessing/base.py", line 71, in _transform_selected
X_sel = transform(X[:, ind[sel]])
File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 456, in _legacy_fit_transform
% type(X))
TypeError: Wrong type for parameter `n_values`. Expected 'auto', int or array of ints, got <class 'numpy.ndarray'>
Description
There is a difference between the actual default n_values
parameter in OneHotEncoder
and the assumption made in documentation and some internal code. This is leading to errors in specific conditions.
-
The documentation here states that the default value is
'auto'
. -
The code here for
_handle_deprecations
assumes that the default value is'auto'
. -
But the actual
__init__
method asn_values=None
as default. -
If I remove the
handle_unknown='ignore'
or addn_values='auto'
in the code, the code runs successfully, but the following warnings are shown:
/home/vivek/anaconda3/envs/tensorflow/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py:368: FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values.
If you want the future behaviour and silence this warning, you can specify "categories='auto'".
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
warnings.warn(msg, FutureWarning)
/home/vivek/anaconda3/envs/tensorflow/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py:390: DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20 and will be removed in 0.22. You can use the ColumnTransformer instead.
"use the ColumnTransformer instead.", DeprecationWarning)
Versions
System: python: 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51) [GCC 7.2.0] executable: /home/vivek/anaconda3/envs/my_env/bin/python machine: Linux-4.15.0-43-generic-x86_64-with-debian-buster-sid
BLAS: macros: SCIPY_MKL_H=None, HAVE_CBLAS=None lib_dirs: /home/vivek/anaconda3/envs/my_env/lib cblas_libs: mkl_rt, pthread
Python deps: pip: 18.1 setuptools: 40.2.0 sklearn: 0.20.1 numpy: 1.15.4 scipy: 1.1.0 Cython: 0.29 pandas: 0.23.4
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:5 (5 by maintainers)
Top GitHub Comments
Thanks for the report, can confirm in master. I’m a bit confused because I was pretty sure we have a test for that. Maybe @jorisvandenbossche has time to investigate? Setting
n_values='auto'
does result in the correct behavior. I guess it’s too late to put this into 0.20.2 😕This seems to be a bad interaction between the presence of both
categorical_features
andhandle_unknown='ignore'
. (the reason that withhandle_unknown='ignore'
, we don’t have to use the legacy mode (unlesscategorical_features
is used, that was the problem) is because in that case there is no difference with the new behaviour (then it is dropping the features for the numbers in the range [0, max] that are not present in the values)Will do a PR shortly.
Note that this is completely as expected.
categorical_features
is deprecated, and you can use the ColumnTransformer to replace it.