Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature: Qcut when passed labels and duplicates='drop' should drop corresponding labels

See original GitHub issue

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
def add_quantiles(data, column, quantiles=4):
    """
    Returns the given dataframe with dummy columns for quantiles of a given column. Quantiles can be a int to 
    specify equal spaced quantiles or an array of quantiles 
    :param data: DataFrame :type data: DataFrame 
    :param column: column to which add quantiles :type column: string 
    :param quantiles: number of quantiles to generate or list of quantiles :type quantiles: Union[int, list of float] 
    :return: DataFrame 
    """
    if isinstance(quantiles, int):
        labels = [column + "_" + str(int(quantile / quantiles * 100)) + "q" for quantile in range(1, quantiles + 1)]
    if isinstance(quantiles, list):
        labels = [column + "_" + str(int(quantile * 100)) + "q" for quantile in quantiles]
        del labels[0]  # Bin labels must be one fewer than the number of bin edges
    data = pd.concat([data, pd.get_dummies(pd.qcut(x=data[column],
                                                   q=quantiles,
                                                   labels=labels, duplicates='drop'))], axis=1)
    return data

zs = np.zeros(3)
rs = np.random.randint(1, 100, size=3)
arr=np.concatenate((zs, rs))
ser = pd.Series(arr)
df = pd.DataFrame({'numbers':ser})
print(df)
#numbers
#0      0.0
#1      0.0
#2      0.0
#3     33.0
#4     81.0
#5     13.0
print(add_quantiles(df, 'numbers'))
Traceback (most recent call last):
  File "pandas_qcut.py", line 29, in <module>
    print(add_quantiles(df, 'numbers'))
  File "pandas_qcut.py", line 20, in add_quantiles
    labels=labels, duplicates='drop'))], axis=1)
  File "/home/mindcraft/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/tile.py", line 206, in qcut
    dtype=dtype, duplicates=duplicates)
  File "/home/mindcraft/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/tile.py", line 252, in _bins_to_cuts
    raise ValueError('Bin labels must be one fewer than '
ValueError: Bin labels must be one fewer than the number of bin edges

Problem description

When using this function with quantiles that return repeated bins, the function raises “ValueError: Bin labels must be one fewer than the number of bin edges”. When using the optional parameter “duplicates” the only way to pass a valid “labels” parameters is checking for duplicate bins beforehand, repeating code in order to calculate the bins.

Expected Output

Pd.qcut should return the quantilizated column with the labels corresponding to the indices of the unique bins. E.g output of add_quantiles.:

   numbers  numbers_50q  numbers_75q  numbers_100q
0      0.0            1            0             0
1      0.0            1            0             0
2      0.0            1            0             0
3     33.0            0            0             1
4     81.0            0            0             1
5     13.0            0            1             0

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None

pandas: 0.22.0 pytest: 3.5.0 pip: 18.0 setuptools: 40.0.0 Cython: 0.28.1 numpy: 1.14.2 scipy: 1.0.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: 1.7.2 patsy: 0.5.0 dateutil: 2.7.2 pytz: 2018.3 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.4 feather: None matplotlib: 2.2.2 openpyxl: 2.5.1 xlrd: 1.1.0 xlwt: 1.2.0 xlsxwriter: 1.0.2 lxml: 4.2.1 bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.5 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 5 years ago
Reactions:5
Comments:12 (4 by maintainers)

Top GitHub Comments

8reactions

qriscommented, Oct 4, 2018

Here is an even simpler example. It works with duplicates=‘drop’ alone:

>>> pandas.qcut([ord(x) for x in list('aaaaaabc')], q=3, retbins=True)
ValueError: Bin edges must be unique: array([ 97.,  97.,  97.,  99.]).
You can drop duplicate edges by setting the 'duplicates' kwarg

>>> pandas.qcut([ord(x) for x in list('aaaaaabc')], q=3, retbins=True, duplicates='drop')
([(96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0]]
Categories (1, interval[float64]): [(96.999, 99.0]], array([ 97.,  99.]))

But if you try to apply labels, then it fails:

>>> pandas.qcut([ord(x) for x in list('aaaaaabc')], q=3, retbins=True, duplicates='drop', labels=[1, 2, 3])
ValueError: Bin labels must be one fewer than the number of bin edges

There is no way to know in advance how many bin edges Pandas is going to drop, or even which ones it has dropped after the fact, so it’s pretty much impossible to use duplicates='drop' and labels together reliably.

4reactions

rgerkincommented, Nov 9, 2021

This fix could be implemented as follows: the duplicates kwarg should retain raise and drop for backwards compatibility, but add merge_left and merge_right to specify whether the the left-most or right-most bin’s label should be retained.

Example:

import numpy as np
import pandas as pd
data = np.random.rand(100)
bin_edges = [0, 0.1, 0.1, 0.7, 1]
labels = ['a', 'b', 'c', 'd']

# Raise unique bins error as currently
pd.cut(data, bin_edges, labels, duplicates='raise')

# Raise labels size error as currently
pd.cut(data, bin_edges, labels, duplicates='raise')  

# Perform as pd.cut(data, [0, 0.1, 0,7, 1], ['a', 'b', 'd'])
pd.qcut(data, bin_edges, labels, duplicates='merge_left')  

# Perform as pd.cut(data, [0, 0.1, 0,7, 1], ['a', 'c', 'd'])
pd.qcut(data, bin_edges, labels, duplicates='merge_right')

Top Results From Across the Web

How to qcut with non unique bin edges? - Stack Overflow

The problem is that pandas.qcut chooses the bins/quantiles so that each one has the same number of records, but all records with the...

pandas.cut — pandas 1.5.2 documentation

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, ... If set duplicates=drop , bins will drop non-unique bin.

Pandas | qcut method with Examples - SkyTowner

Pandas qcut(~) method categorises numerical values into quantile bins (intervals). ... The number of decimal places to include up until for the bin...

ValueError: Bin Edges Must Be Unique - DataDrivenInvestor

Most common errors of qcut function in Pandas are solved. ValueError: Bin Edges Must Be Unique, Bin labels must be one fewer than...

"Bin labels must be one fewer than the number of ... - YouTube

Pandas : "Bin labels must be one fewer than the number of bin edges" after passing pd. qcut duplicates='drop ' kwarg [ Beautify...