Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TypeError : Wrong type for parameter `n_values` in OneHotEncoder

See original GitHub issue

Steps/Code to Reproduce

import numpy as np
from sklearn.preprocessing import OneHotEncoder

numerical_features = np.random.randint(10, size=(5,4))
categorical = np.array([2, 2, 3, 2, 3]).reshape(-1,1)

X = np.hstack((numerical_features, categorical))

onehotencoder = OneHotEncoder(categorical_features=[4], 
                              handle_unknown='ignore')

X_encoded = onehotencoder.fit_transform(X)

Expected Results

No error should be thrown. OneHotEncoder should work as legacy and encode only the supplied columns.

Actual Results

/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py:390: DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20 and will be removed in 0.22. You can use the ColumnTransformer instead.
  "use the ColumnTransformer instead.", DeprecationWarning)
Traceback (most recent call last):

  File "<ipython-input-15-c174bb78e628>", line 1, in <module>
    runfile('/home/vivek/untitless.py', wdir='/home/vivek')

  File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/spyder_kernels/customize/spydercustomize.py", line 668, in runfile
    execfile(filename, namespace)

  File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/spyder_kernels/customize/spydercustomize.py", line 108, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/home/vivek/untitless.py", line 24, in <module>
    X_encoded = onehotencoder.fit_transform(X)

  File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 514, in fit_transform
    self._categorical_features, copy=True)

  File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/sklearn/preprocessing/base.py", line 71, in _transform_selected
    X_sel = transform(X[:, ind[sel]])

  File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 456, in _legacy_fit_transform
    % type(X))

TypeError: Wrong type for parameter `n_values`. Expected 'auto', int or array of ints, got <class 'numpy.ndarray'>

Description

There is a difference between the actual default n_values parameter in OneHotEncoder and the assumption made in documentation and some internal code. This is leading to errors in specific conditions.

The documentation here states that the default value is 'auto'.
The code here for _handle_deprecations assumes that the default value is 'auto'.
But the actual __init__ method as n_values=None as default.
If I remove the handle_unknown='ignore' or add n_values='auto' in the code, the code runs successfully, but the following warnings are shown:

/home/vivek/anaconda3/envs/tensorflow/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py:368: FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values.
If you want the future behaviour and silence this warning, you can specify "categories='auto'".
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
  warnings.warn(msg, FutureWarning)
/home/vivek/anaconda3/envs/tensorflow/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py:390: DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20 and will be removed in 0.22. You can use the ColumnTransformer instead.
  "use the ColumnTransformer instead.", DeprecationWarning)

Versions

System: python: 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51) [GCC 7.2.0] executable: /home/vivek/anaconda3/envs/my_env/bin/python machine: Linux-4.15.0-43-generic-x86_64-with-debian-buster-sid

BLAS: macros: SCIPY_MKL_H=None, HAVE_CBLAS=None lib_dirs: /home/vivek/anaconda3/envs/my_env/lib cblas_libs: mkl_rt, pthread

Python deps: pip: 18.1 setuptools: 40.2.0 sklearn: 0.20.1 numpy: 1.15.4 scipy: 1.1.0 Cython: 0.29 pandas: 0.23.4

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

amuellercommented, Dec 28, 2018

Thanks for the report, can confirm in master. I’m a bit confused because I was pretty sure we have a test for that. Maybe @jorisvandenbossche has time to investigate? Setting n_values='auto' does result in the correct behavior. I guess it’s too late to put this into 0.20.2 😕

0reactions

jorisvandenbosschecommented, Jan 4, 2019

This seems to be a bad interaction between the presence of both categorical_features and handle_unknown='ignore'. (the reason that with handle_unknown='ignore', we don’t have to use the legacy mode (unless categorical_features is used, that was the problem) is because in that case there is no difference with the new behaviour (then it is dropping the features for the numbers in the range [0, max] that are not present in the values)

Will do a PR shortly.

If I remove the handle_unknown=‘ignore’ or add n_values=‘auto’ in the code, the code runs successfully, but the following warnings are shown:

Note that this is completely as expected. categorical_features is deprecated, and you can use the ColumnTransformer to replace it.

Top Results From Across the Web

Python sklearn onehotencoder - Stack Overflow

I've checked if I have any missing values or any strings and I don't. All features are integers. Thanks. ... @Konstantin it's a...

sklearn.preprocessing.OneHotEncoder

Encode categorical features as a one-hot numeric array. ... By default, the encoder derives the categories based on the unique values in each...

How to One Hot Encode Sequence Data in Python

In this case, we disabled the sparse return type by setting the sparse=False argument. If we receive a prediction in this 3-value one...

Use ColumnTransformer in SciKit instead of LabelEncoding ...

In this case, we'll only transform the first column. The second parameter we're interested in is the remainder. This will tell the transformer ......

issue with oneHotEncoding - Data Science Stack Exchange

Actually, earlier OneHotEncoding needed numerical value first (earlier we couldn't ... from sklearn.preprocessing import OneHotEncoder onehotencoder ...