question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH: Allow selection of reference values in get_dummies

See original GitHub issue

When one-hot encoding a pandas categorical column, with drop_first = True, there is no control over which value is dropped. So if I need to specify the reference value to drop, I can’t use drop_first. I have to manually drop the columns that have been unnecessarily created.

I would like to enhance the get_dummies method to be able to specify for each column in ‘columns’ a reference value to be used as the dropped column. For example:


df = pd.DataFrame(data={'ageband': np.random.choice(range(20,80,10),100),
                        'country':np.random.choice(['uk', 'nl', 'be', 'fr'],100)}).astype('category')
ref_values = {'ageband': '40', 'country':'be'}

df_encoded = pd.get_dummies(df, columns=['ageband', 'country'],  ref_values= ref_values)

for each categorical column specified in the new ref_values parameter: if value does exists, use that as the reference value if value does not exist, proceed with normal behaviour - i.e. drop the (lexical) first (or ignore with warning?)

I don’t think there are any API breaking implications?

The achieve this now, without the enhancement, I have to do something like :

df = pd.DataFrame(data={'ageband': np.random.choice(range(20,80,10),100),
                        'country':np.random.choice(['uk', 'nl', 'be', 'fr'],100)}).astype('category')

prefix_sep = ':'  # say
df = pd.get_dummies(df, columns=['ageband', 'country'], prefix_sep=prefix_sep)
ref_values = {'ageband': 40, 'country': 'be'}

columns_to_drop = [ col + prefix_sep + str(val) for col, val in ref_values.items()]
df.drop(columns=columns_to_drop, inplace=True, errors='ignore')
# additional code would be required to handle errors such as reference value not present

I am willing to have a go at this if it is accepted as an enhancement

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
telferm57commented, May 4, 2020

I have started coding this, should have it complete soon

0reactions
telferm57commented, May 1, 2020

I suggest further:

if ref_vals is not a list, or the length of ref_vals != length of ‘columns’ , an exception will be raised

if drop_first =True and ref_vals is supplied, raise an exception (could just warn?)

if object is a series, allow ref_vals = ‘string’ as well as ref_vals =[‘string’]

Read more comments on GitHub >

github_iconTop Results From Across the Web

pandas.get_dummies — pandas 1.5.2 documentation
Convert categorical variable into dummy/indicator variables. Parameters. dataarray-like, Series, or DataFrame. Data of which to get dummy indicators.
Read more >
How to Use Pandas Get Dummies in Python - Sharp Sight
In this tutorial, I'll show you how to use the Pandas get dummies function to create dummy variables in Python.
Read more >
How to Choose a Feature Selection Method For Machine ...
Try a range of different models fit on different subsets of features chosen via different statistical measures and discover what works best for ......
Read more >
Will Kurt on Twitter: "Pandas.get_dummies dtype defaults to a ...
Pandas.get_dummies dtype defaults to a np.unit8! This is a pretty big "gotcha" if you ever want to subtract two binary vectors created with...
Read more >
Part VI Open Response The following part is purposefully left ...
Itmay be helpful to read more about the arguments topd.get_dummies(-docs/stable/reference/api/pandas.get_dummies.html), which can actually handleNaNvalues ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found