One hot encoding without categorical data
See original GitHub issueDescribe the bug
My autoML workflow exports that he is performing one-hot encoding on my dataset, whereas we ensure that it is not categorical in before passing to the AutoML pipeline any dataset. So that we did binarise / one-hot encode it ourselves. How is this conceivable? Have you any hints? Perhaps our reading of the AutoML log is incorrect?
To Reproduce
Reproducing the behaviour consists of the following steps:
- If you require the dataset urgently, I will have to take the time to completely anonymize it and retry the AutoML process if the error persists, before releasing the dataset to you here. That is personal information. I could not say much more than that, but it is a health-related issue.
Expected behavior
I would not like the AutoML workflow to use a single hot-encoding approach to my data, as each feature is binary.
Actual behavior, stacktrace or logfile
What motivates us to seek assistance is as follows:
data preprocessor:feature type:categorical transformer:categorical encoding: choice__': 'one hot encoding
Is that correct to claim that the workflow first applied a single hot technique to the data? If this is the case, we have a problem.
Environment and installation:
Please give details about your installation:
- OS: Ubuntu 18 stable version
- Is your installation in a virtual environment or conda environment?: No.
- Python version: 3.8.12
- Auto-sklearn version: 0.14.6
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (4 by maintainers)
Top Results From Across the Web
Ordinal and One-Hot Encodings for Categorical Data
It is a natural encoding for ordinal variables. For categorical variables, it imposes an ordinal relationship where no such relationship may ...
Read more >How to Perform One-Hot Encoding For Multi Categorical ...
One -hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary...
Read more >Using Categorical Data with One Hot Encoding | Kaggle
One hot encoding is the most widespread approach, and it works very well unless your categorical variable takes on a large number of...
Read more >One-hot encoding in Python - Educative.io
One -hot encoding is essentially the representation of categorical variables as binary vectors. These categorical values are first mapped to integer values. Each ......
Read more >Encoding Categorical Variables: One-hot vs Dummy Encoding
In one-hot encoding, we create a new set of dummy (binary) variables that is equal to the number of categories (k) in the...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi @simonprovost,
Sorry yes I did look into but I didn’t reply so that’s my bad. Essentially as far as and end user should be concerned, no one hot encoding happens as there are no categorical for it to encode. You data is not effected by any categorical transformers that appear, they are never applied.
The more detailed answer is that the search space is not reduced based on the types of columns introduced. As seen here, we basically define which kind of pipeline to apply to each type of column. These pipelines are then optimized for any hyperparameters they may have.
To find the hyperparameters for datapreprocessing, we query them to see if they have any.
To give some more complete information, here’s the
NumericalPreprocessingPipeline
that handles numerical preprocessing (i.e. filling NaN’s). You can scroll down to it’s_get_pipeline_steps
to see the steps involved and further look at them to see what hyperparameters the may have (example imputation which has only one hyperparameter.)Now, if you’re interested in how this categorical preprocessing pipeline is then effecting the search space, it’s much the same process. Here are the steps for the
CategoricalPreprocessingPipeline
pipeline.Out of those steps, only
"category_coalescence"
and"categorical_encoding"
have hyperparameters. ForOHEChoice
), i traced it down to it having three hyperparameters, which are essentially the three components here,[encoding, no_encoding, one_hot_encoding]
. So following the same pattern, I think"category_coalescene"
follows the same pattern and has two hyperparameters[minorty_coalescence, no_coalescence]
.Two points,
OHEChoice
is a bad name for the class as it could technically do OrdinalEncoding. Second, the optimization over head is relatively small. We use SMAC which is intelligent enough to pick up (given enough time) that these hyperparameters have little/no change associated with them. However it still is some overhead for the optimizer to learn that.Sorry for the big info dump, it also serves as a future reference for when we have time to go back and fix it 😃 Hope it was informative.
I’ll keep this open and labelled as a bug as it is a bug and has some potential performance implications.
Best, Eddie
Yes, if you have no categorical data in your input then no categorical pre-processing will be applied. Even if it says it chose a categorical pre-processor, that choice means nothing as it can’t apply it to anything.
The way to ensure your data is interpreted correctly is that:
np.ndarray
data, then you have to manually specify withfeat_types
params to specify a categorical, otherwise we try use thedtype
of the array which is almost certainly numeric.df.dtypes
to check. We will treat “object”, “string”, “category” and “categorical” as categorical data.I will note for any other readers in the future, we have some preliminary string processing so that note about “string” and “object” will change, however that’s not relevant for this discussion.
Glad you found it helpful 😃
Best, Eddie