ENH: Allow selection of reference values in get_dummies
See original GitHub issueWhen one-hot encoding a pandas categorical column, with drop_first = True, there is no control over which value is dropped. So if I need to specify the reference value to drop, I can’t use drop_first. I have to manually drop the columns that have been unnecessarily created.
I would like to enhance the get_dummies method to be able to specify for each column in ‘columns’ a reference value to be used as the dropped column. For example:
df = pd.DataFrame(data={'ageband': np.random.choice(range(20,80,10),100),
'country':np.random.choice(['uk', 'nl', 'be', 'fr'],100)}).astype('category')
ref_values = {'ageband': '40', 'country':'be'}
df_encoded = pd.get_dummies(df, columns=['ageband', 'country'], ref_values= ref_values)
for each categorical column specified in the new ref_values parameter: if value does exists, use that as the reference value if value does not exist, proceed with normal behaviour - i.e. drop the (lexical) first (or ignore with warning?)
I don’t think there are any API breaking implications?
The achieve this now, without the enhancement, I have to do something like :
df = pd.DataFrame(data={'ageband': np.random.choice(range(20,80,10),100),
'country':np.random.choice(['uk', 'nl', 'be', 'fr'],100)}).astype('category')
prefix_sep = ':' # say
df = pd.get_dummies(df, columns=['ageband', 'country'], prefix_sep=prefix_sep)
ref_values = {'ageband': 40, 'country': 'be'}
columns_to_drop = [ col + prefix_sep + str(val) for col, val in ref_values.items()]
df.drop(columns=columns_to_drop, inplace=True, errors='ignore')
# additional code would be required to handle errors such as reference value not present
I am willing to have a go at this if it is accepted as an enhancement
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
I have started coding this, should have it complete soon
I suggest further:
if ref_vals is not a list, or the length of ref_vals != length of ‘columns’ , an exception will be raised
if drop_first =True and ref_vals is supplied, raise an exception (could just warn?)
if object is a series, allow ref_vals = ‘string’ as well as ref_vals =[‘string’]