Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

One hot encoding without categorical data

See original GitHub issue

Describe the bug

My autoML workflow exports that he is performing one-hot encoding on my dataset, whereas we ensure that it is not categorical in before passing to the AutoML pipeline any dataset. So that we did binarise / one-hot encode it ourselves. How is this conceivable? Have you any hints? Perhaps our reading of the AutoML log is incorrect?

To Reproduce

Reproducing the behaviour consists of the following steps:

If you require the dataset urgently, I will have to take the time to completely anonymize it and retry the AutoML process if the error persists, before releasing the dataset to you here. That is personal information. I could not say much more than that, but it is a health-related issue.

Expected behavior

I would not like the AutoML workflow to use a single hot-encoding approach to my data, as each feature is binary.

Actual behavior, stacktrace or logfile

What motivates us to seek assistance is as follows:

data preprocessor:feature type:categorical transformer:categorical encoding: choice__': 'one hot encoding

Is that correct to claim that the workflow first applied a single hot technique to the data? If this is the case, we have a problem.

Environment and installation:

Please give details about your installation:

OS: Ubuntu 18 stable version
Is your installation in a virtual environment or conda environment?: No.
Python version: 3.8.12
Auto-sklearn version: 0.14.6

Issue Analytics

State:
Created 2 years ago
Comments:9 (4 by maintainers)

Top GitHub Comments

1reaction

eddiebergmancommented, Mar 31, 2022

Hi @simonprovost,

Sorry yes I did look into but I didn’t reply so that’s my bad. Essentially as far as and end user should be concerned, no one hot encoding happens as there are no categorical for it to encode. You data is not effected by any categorical transformers that appear, they are never applied.

The more detailed answer is that the search space is not reduced based on the types of columns introduced. As seen here, we basically define which kind of pipeline to apply to each type of column. These pipelines are then optimized for any hyperparameters they may have.

To find the hyperparameters for datapreprocessing, we query them to see if they have any.

To give some more complete information, here’s the NumericalPreprocessingPipeline that handles numerical preprocessing (i.e. filling NaN’s). You can scroll down to it’s _get_pipeline_steps to see the steps involved and further look at them to see what hyperparameters the may have (example imputation which has only one hyperparameter.)

Now, if you’re interested in how this categorical preprocessing pipeline is then effecting the search space, it’s much the same process. Here are the steps for the CategoricalPreprocessingPipeline pipeline.

Out of those steps, only "category_coalescence" and "categorical_encoding" have hyperparameters. For OHEChoice), i traced it down to it having three hyperparameters, which are essentially the three components here, [encoding, no_encoding, one_hot_encoding]. So following the same pattern, I think "category_coalescene" follows the same pattern and has two hyperparameters [minorty_coalescence, no_coalescence].

Two points, OHEChoice is a bad name for the class as it could technically do OrdinalEncoding. Second, the optimization over head is relatively small. We use SMAC which is intelligent enough to pick up (given enough time) that these hyperparameters have little/no change associated with them. However it still is some overhead for the optimizer to learn that.

Sorry for the big info dump, it also serves as a future reference for when we have time to go back and fix it 😃 Hope it was informative.

I’ll keep this open and labelled as a bug as it is a bug and has some potential performance implications.

Best, Eddie

0reactions

eddiebergmancommented, Mar 31, 2022

Yes, if you have no categorical data in your input then no categorical pre-processing will be applied. Even if it says it chose a categorical pre-processor, that choice means nothing as it can’t apply it to anything.

The way to ensure your data is interpreted correctly is that:

If using np.ndarray data, then you have to manually specify with feat_types params to specify a categorical, otherwise we try use the dtype of the array which is almost certainly numeric.
If using a pandas dataframe, you can use the df.dtypes to check. We will treat “object”, “string”, “category” and “categorical” as categorical data.

I will note for any other readers in the future, we have some preliminary string processing so that note about “string” and “object” will change, however that’s not relevant for this discussion.

Glad you found it helpful 😃

Best, Eddie

Top Results From Across the Web

Ordinal and One-Hot Encodings for Categorical Data

It is a natural encoding for ordinal variables. For categorical variables, it imposes an ordinal relationship where no such relationship may ...

How to Perform One-Hot Encoding For Multi Categorical ...

One -hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary...

Using Categorical Data with One Hot Encoding | Kaggle

One hot encoding is the most widespread approach, and it works very well unless your categorical variable takes on a large number of...

One-hot encoding in Python - Educative.io

One -hot encoding is essentially the representation of categorical variables as binary vectors. These categorical values are first mapped to integer values. Each ......

Encoding Categorical Variables: One-hot vs Dummy Encoding

In one-hot encoding, we create a new set of dummy (binary) variables that is equal to the number of categories (k) in the...