TypeError : Wrong type for parameter `n_values` in OneHotEncoder
See original GitHub issueSteps/Code to Reproduce
import numpy as np
from sklearn.preprocessing import OneHotEncoder
numerical_features = np.random.randint(10, size=(5,4))
categorical = np.array([2, 2, 3, 2, 3]).reshape(-1,1)
X = np.hstack((numerical_features, categorical))
onehotencoder = OneHotEncoder(categorical_features=[4],
handle_unknown='ignore')
X_encoded = onehotencoder.fit_transform(X)
Expected Results
No error should be thrown. OneHotEncoder should work as legacy and encode only the supplied columns.
Actual Results
/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py:390: DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20 and will be removed in 0.22. You can use the ColumnTransformer instead.
"use the ColumnTransformer instead.", DeprecationWarning)
Traceback (most recent call last):
File "<ipython-input-15-c174bb78e628>", line 1, in <module>
runfile('/home/vivek/untitless.py', wdir='/home/vivek')
File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/spyder_kernels/customize/spydercustomize.py", line 668, in runfile
execfile(filename, namespace)
File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/spyder_kernels/customize/spydercustomize.py", line 108, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/home/vivek/untitless.py", line 24, in <module>
X_encoded = onehotencoder.fit_transform(X)
File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 514, in fit_transform
self._categorical_features, copy=True)
File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/sklearn/preprocessing/base.py", line 71, in _transform_selected
X_sel = transform(X[:, ind[sel]])
File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 456, in _legacy_fit_transform
% type(X))
TypeError: Wrong type for parameter `n_values`. Expected 'auto', int or array of ints, got <class 'numpy.ndarray'>
Description
There is a difference between the actual default n_values parameter in OneHotEncoder and the assumption made in documentation and some internal code. This is leading to errors in specific conditions.
-
The documentation here states that the default value is
'auto'. -
The code here for
_handle_deprecationsassumes that the default value is'auto'. -
But the actual
__init__method asn_values=Noneas default. -
If I remove the
handle_unknown='ignore'or addn_values='auto'in the code, the code runs successfully, but the following warnings are shown:
/home/vivek/anaconda3/envs/tensorflow/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py:368: FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values.
If you want the future behaviour and silence this warning, you can specify "categories='auto'".
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
warnings.warn(msg, FutureWarning)
/home/vivek/anaconda3/envs/tensorflow/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py:390: DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20 and will be removed in 0.22. You can use the ColumnTransformer instead.
"use the ColumnTransformer instead.", DeprecationWarning)
Versions
System: python: 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51) [GCC 7.2.0] executable: /home/vivek/anaconda3/envs/my_env/bin/python machine: Linux-4.15.0-43-generic-x86_64-with-debian-buster-sid
BLAS: macros: SCIPY_MKL_H=None, HAVE_CBLAS=None lib_dirs: /home/vivek/anaconda3/envs/my_env/lib cblas_libs: mkl_rt, pthread
Python deps: pip: 18.1 setuptools: 40.2.0 sklearn: 0.20.1 numpy: 1.15.4 scipy: 1.1.0 Cython: 0.29 pandas: 0.23.4
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:5 (5 by maintainers)

Top Related StackOverflow Question
Thanks for the report, can confirm in master. I’m a bit confused because I was pretty sure we have a test for that. Maybe @jorisvandenbossche has time to investigate? Setting
n_values='auto'does result in the correct behavior. I guess it’s too late to put this into 0.20.2 😕This seems to be a bad interaction between the presence of both
categorical_featuresandhandle_unknown='ignore'. (the reason that withhandle_unknown='ignore', we don’t have to use the legacy mode (unlesscategorical_featuresis used, that was the problem) is because in that case there is no difference with the new behaviour (then it is dropping the features for the numbers in the range [0, max] that are not present in the values)Will do a PR shortly.
Note that this is completely as expected.
categorical_featuresis deprecated, and you can use the ColumnTransformer to replace it.