Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OneHotEncoding issue

See original GitHub issue

When I OneHotEncode the behaviour is as expected

one_hot_encoder = vaex.ml.OneHotEncoder(features=["scp"])
training_data = one_hot_encoder.fit_transform(data)

And this also works as expected training_data.get_column_names()

I get 'scp_0.0', 'scp_0.1', 'scp_0.3', 'scp_0.4', 'scp_0.5', 'scp_0.8', 'scp_0.9', 'scp_1.0', 'scp_1.1', 'scp_1.3', 'scp_1.8', 'scp_1.9',

But When I try this training_data[['scp_0.0', 'scp_0.1']] or training_data[training_data.get_column_names()] I get an error message :

File "C:\Program Files\Anaconda3\lib\ast.py", line 35, in parse return compile(source, filename, mode, PyCF_ONLY_AST) File "<unknown>", line 1 scp_0.0 ^ SyntaxError: invalid syntax

But training_data['scp_0.0'] shows right value.

One work around for this was training_data[training_data.column_names] But then I am unable to fit the data, training fails with the above message. The columns have no missing values, am I missing something?

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

maartenbreddelscommented, Jun 4, 2020

This is released now, you can try it out with $ pip install "vaex-core>=2.0.2"

1reaction

JovanVeljanoskicommented, Jun 3, 2020

Hi @arjunrao01

Thanks for the report. This is a rather complex issue related to how the Expression system works. We hope to make a better solution for this soon.

In the meantime, you can try using training_data[training_data.get_column_names(alias=False)]

That will give you the expression names that vaex understands and everything should work from there.

Top Results From Across the Web

What are the main issues with using one-hot encoding? - Quora

One hot encoding is a binary representation of a categorical data. This became popular after deep learning came into practice because categorical data...

Stop One-Hot Encoding Your Categorical Variables. - Medium

One -hot encoding, otherwise known as dummy variables, is a method of converting categorical variables into several binary columns, where a 1 ...

Categorical Encoding | One Hot Encoding vs Label Encoding

One -Hot Encoding results in a Dummy Variable Trap as the outcome of one variable can easily be predicted with the help of...

Are You Getting Burned By One-Hot Encoding?

Tree-based models, such as Decision Trees, Random Forests, and Boosted Trees, typically don't perform well with one-hot encodings with lots of ...

Problem in one hot encoding | Data Science and ... - Kaggle

So according to this, when you one-hot encode the datasets differently, you will have 3 columns generated in training dataset for column "A",...