question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TypeError : Wrong type for parameter `n_values` in OneHotEncoder

See original GitHub issue

Steps/Code to Reproduce

import numpy as np
from sklearn.preprocessing import OneHotEncoder

numerical_features = np.random.randint(10, size=(5,4))
categorical = np.array([2, 2, 3, 2, 3]).reshape(-1,1)

X = np.hstack((numerical_features, categorical))

onehotencoder = OneHotEncoder(categorical_features=[4], 
                              handle_unknown='ignore')

X_encoded = onehotencoder.fit_transform(X)

Expected Results

No error should be thrown. OneHotEncoder should work as legacy and encode only the supplied columns.

Actual Results

/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py:390: DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20 and will be removed in 0.22. You can use the ColumnTransformer instead.
  "use the ColumnTransformer instead.", DeprecationWarning)
Traceback (most recent call last):

  File "<ipython-input-15-c174bb78e628>", line 1, in <module>
    runfile('/home/vivek/untitless.py', wdir='/home/vivek')

  File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/spyder_kernels/customize/spydercustomize.py", line 668, in runfile
    execfile(filename, namespace)

  File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/spyder_kernels/customize/spydercustomize.py", line 108, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/home/vivek/untitless.py", line 24, in <module>
    X_encoded = onehotencoder.fit_transform(X)

  File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 514, in fit_transform
    self._categorical_features, copy=True)

  File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/sklearn/preprocessing/base.py", line 71, in _transform_selected
    X_sel = transform(X[:, ind[sel]])

  File "/home/vivek/anaconda3/envs/my_env/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 456, in _legacy_fit_transform
    % type(X))

TypeError: Wrong type for parameter `n_values`. Expected 'auto', int or array of ints, got <class 'numpy.ndarray'>

Description

There is a difference between the actual default n_values parameter in OneHotEncoder and the assumption made in documentation and some internal code. This is leading to errors in specific conditions.

  • The documentation here states that the default value is 'auto'.

  • The code here for _handle_deprecations assumes that the default value is 'auto'.

  • But the actual __init__ method as n_values=None as default.

  • If I remove the handle_unknown='ignore' or add n_values='auto' in the code, the code runs successfully, but the following warnings are shown:

/home/vivek/anaconda3/envs/tensorflow/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py:368: FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values.
If you want the future behaviour and silence this warning, you can specify "categories='auto'".
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
  warnings.warn(msg, FutureWarning)
/home/vivek/anaconda3/envs/tensorflow/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py:390: DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20 and will be removed in 0.22. You can use the ColumnTransformer instead.
  "use the ColumnTransformer instead.", DeprecationWarning)

Versions

System: python: 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51) [GCC 7.2.0] executable: /home/vivek/anaconda3/envs/my_env/bin/python machine: Linux-4.15.0-43-generic-x86_64-with-debian-buster-sid

BLAS: macros: SCIPY_MKL_H=None, HAVE_CBLAS=None lib_dirs: /home/vivek/anaconda3/envs/my_env/lib cblas_libs: mkl_rt, pthread

Python deps: pip: 18.1 setuptools: 40.2.0 sklearn: 0.20.1 numpy: 1.15.4 scipy: 1.1.0 Cython: 0.29 pandas: 0.23.4

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:1
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
amuellercommented, Dec 28, 2018

Thanks for the report, can confirm in master. I’m a bit confused because I was pretty sure we have a test for that. Maybe @jorisvandenbossche has time to investigate? Setting n_values='auto' does result in the correct behavior. I guess it’s too late to put this into 0.20.2 😕

0reactions
jorisvandenbosschecommented, Jan 4, 2019

This seems to be a bad interaction between the presence of both categorical_features and handle_unknown='ignore'. (the reason that with handle_unknown='ignore', we don’t have to use the legacy mode (unless categorical_features is used, that was the problem) is because in that case there is no difference with the new behaviour (then it is dropping the features for the numbers in the range [0, max] that are not present in the values)

Will do a PR shortly.

If I remove the handle_unknown=‘ignore’ or add n_values=‘auto’ in the code, the code runs successfully, but the following warnings are shown:

Note that this is completely as expected. categorical_features is deprecated, and you can use the ColumnTransformer to replace it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python sklearn onehotencoder - Stack Overflow
I've checked if I have any missing values or any strings and I don't. All features are integers. Thanks. ... @Konstantin it's a...
Read more >
sklearn.preprocessing.OneHotEncoder
Encode categorical features as a one-hot numeric array. ... By default, the encoder derives the categories based on the unique values in each...
Read more >
How to One Hot Encode Sequence Data in Python
In this case, we disabled the sparse return type by setting the sparse=False argument. If we receive a prediction in this 3-value one...
Read more >
Use ColumnTransformer in SciKit instead of LabelEncoding ...
In this case, we'll only transform the first column. The second parameter we're interested in is the remainder. This will tell the transformer ......
Read more >
issue with oneHotEncoding - Data Science Stack Exchange
Actually, earlier OneHotEncoding needed numerical value first (earlier we couldn't ... from sklearn.preprocessing import OneHotEncoder onehotencoder ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found