Feature: Qcut when passed labels and duplicates='drop' should drop corresponding labels
See original GitHub issueCode Sample, a copy-pastable example if possible
import pandas as pd
import numpy as np
def add_quantiles(data, column, quantiles=4):
"""
Returns the given dataframe with dummy columns for quantiles of a given column. Quantiles can be a int to
specify equal spaced quantiles or an array of quantiles
:param data: DataFrame :type data: DataFrame
:param column: column to which add quantiles :type column: string
:param quantiles: number of quantiles to generate or list of quantiles :type quantiles: Union[int, list of float]
:return: DataFrame
"""
if isinstance(quantiles, int):
labels = [column + "_" + str(int(quantile / quantiles * 100)) + "q" for quantile in range(1, quantiles + 1)]
if isinstance(quantiles, list):
labels = [column + "_" + str(int(quantile * 100)) + "q" for quantile in quantiles]
del labels[0] # Bin labels must be one fewer than the number of bin edges
data = pd.concat([data, pd.get_dummies(pd.qcut(x=data[column],
q=quantiles,
labels=labels, duplicates='drop'))], axis=1)
return data
zs = np.zeros(3)
rs = np.random.randint(1, 100, size=3)
arr=np.concatenate((zs, rs))
ser = pd.Series(arr)
df = pd.DataFrame({'numbers':ser})
print(df)
#numbers
#0 0.0
#1 0.0
#2 0.0
#3 33.0
#4 81.0
#5 13.0
print(add_quantiles(df, 'numbers'))
Traceback (most recent call last):
File "pandas_qcut.py", line 29, in <module>
print(add_quantiles(df, 'numbers'))
File "pandas_qcut.py", line 20, in add_quantiles
labels=labels, duplicates='drop'))], axis=1)
File "/home/mindcraft/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/tile.py", line 206, in qcut
dtype=dtype, duplicates=duplicates)
File "/home/mindcraft/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/tile.py", line 252, in _bins_to_cuts
raise ValueError('Bin labels must be one fewer than '
ValueError: Bin labels must be one fewer than the number of bin edges
Problem description
When using this function with quantiles that return repeated bins, the function raises “ValueError: Bin labels must be one fewer than the number of bin edges”. When using the optional parameter “duplicates” the only way to pass a valid “labels” parameters is checking for duplicate bins beforehand, repeating code in order to calculate the bins.
Expected Output
Pd.qcut should return the quantilizated column with the labels corresponding to the indices of the unique bins. E.g output of add_quantiles.:
numbers numbers_50q numbers_75q numbers_100q
0 0.0 1 0 0
1 0.0 1 0 0
2 0.0 1 0 0
3 33.0 0 0 1
4 81.0 0 0 1
5 13.0 0 1 0
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
pandas: 0.22.0 pytest: 3.5.0 pip: 18.0 setuptools: 40.0.0 Cython: 0.28.1 numpy: 1.14.2 scipy: 1.0.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: 1.7.2 patsy: 0.5.0 dateutil: 2.7.2 pytz: 2018.3 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.4 feather: None matplotlib: 2.2.2 openpyxl: 2.5.1 xlrd: 1.1.0 xlwt: 1.2.0 xlsxwriter: 1.0.2 lxml: 4.2.1 bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.5 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 5 years ago
- Reactions:5
- Comments:12 (4 by maintainers)
Top GitHub Comments
Here is an even simpler example. It works with duplicates=‘drop’ alone:
But if you try to apply labels, then it fails:
There is no way to know in advance how many bin edges Pandas is going to drop, or even which ones it has dropped after the fact, so it’s pretty much impossible to use
duplicates='drop'
andlabels
together reliably.This fix could be implemented as follows: the
duplicates
kwarg should retainraise
anddrop
for backwards compatibility, but addmerge_left
andmerge_right
to specify whether the the left-most or right-most bin’s label should be retained.Example: