question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature: Qcut when passed labels and duplicates='drop' should drop corresponding labels

See original GitHub issue

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
def add_quantiles(data, column, quantiles=4):
    """
    Returns the given dataframe with dummy columns for quantiles of a given column. Quantiles can be a int to 
    specify equal spaced quantiles or an array of quantiles 
    :param data: DataFrame :type data: DataFrame 
    :param column: column to which add quantiles :type column: string 
    :param quantiles: number of quantiles to generate or list of quantiles :type quantiles: Union[int, list of float] 
    :return: DataFrame 
    """
    if isinstance(quantiles, int):
        labels = [column + "_" + str(int(quantile / quantiles * 100)) + "q" for quantile in range(1, quantiles + 1)]
    if isinstance(quantiles, list):
        labels = [column + "_" + str(int(quantile * 100)) + "q" for quantile in quantiles]
        del labels[0]  # Bin labels must be one fewer than the number of bin edges
    data = pd.concat([data, pd.get_dummies(pd.qcut(x=data[column],
                                                   q=quantiles,
                                                   labels=labels, duplicates='drop'))], axis=1)
    return data

zs = np.zeros(3)
rs = np.random.randint(1, 100, size=3)
arr=np.concatenate((zs, rs))
ser = pd.Series(arr)
df = pd.DataFrame({'numbers':ser})
print(df)
#numbers
#0      0.0
#1      0.0
#2      0.0
#3     33.0
#4     81.0
#5     13.0
print(add_quantiles(df, 'numbers'))
Traceback (most recent call last):
  File "pandas_qcut.py", line 29, in <module>
    print(add_quantiles(df, 'numbers'))
  File "pandas_qcut.py", line 20, in add_quantiles
    labels=labels, duplicates='drop'))], axis=1)
  File "/home/mindcraft/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/tile.py", line 206, in qcut
    dtype=dtype, duplicates=duplicates)
  File "/home/mindcraft/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/tile.py", line 252, in _bins_to_cuts
    raise ValueError('Bin labels must be one fewer than '
ValueError: Bin labels must be one fewer than the number of bin edges

Problem description

When using this function with quantiles that return repeated bins, the function raises “ValueError: Bin labels must be one fewer than the number of bin edges”. When using the optional parameter “duplicates” the only way to pass a valid “labels” parameters is checking for duplicate bins beforehand, repeating code in order to calculate the bins.

Expected Output

Pd.qcut should return the quantilizated column with the labels corresponding to the indices of the unique bins. E.g output of add_quantiles.:

   numbers  numbers_50q  numbers_75q  numbers_100q
0      0.0            1            0             0
1      0.0            1            0             0
2      0.0            1            0             0
3     33.0            0            0             1
4     81.0            0            0             1
5     13.0            0            1             0

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None

pandas: 0.22.0 pytest: 3.5.0 pip: 18.0 setuptools: 40.0.0 Cython: 0.28.1 numpy: 1.14.2 scipy: 1.0.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: 1.7.2 patsy: 0.5.0 dateutil: 2.7.2 pytz: 2018.3 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.4 feather: None matplotlib: 2.2.2 openpyxl: 2.5.1 xlrd: 1.1.0 xlwt: 1.2.0 xlsxwriter: 1.0.2 lxml: 4.2.1 bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.5 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:5
  • Comments:12 (4 by maintainers)

github_iconTop GitHub Comments

8reactions
qriscommented, Oct 4, 2018

Here is an even simpler example. It works with duplicates=‘drop’ alone:

>>> pandas.qcut([ord(x) for x in list('aaaaaabc')], q=3, retbins=True)
ValueError: Bin edges must be unique: array([ 97.,  97.,  97.,  99.]).
You can drop duplicate edges by setting the 'duplicates' kwarg

>>> pandas.qcut([ord(x) for x in list('aaaaaabc')], q=3, retbins=True, duplicates='drop')
([(96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0]]
Categories (1, interval[float64]): [(96.999, 99.0]], array([ 97.,  99.]))

But if you try to apply labels, then it fails:

>>> pandas.qcut([ord(x) for x in list('aaaaaabc')], q=3, retbins=True, duplicates='drop', labels=[1, 2, 3])
ValueError: Bin labels must be one fewer than the number of bin edges

There is no way to know in advance how many bin edges Pandas is going to drop, or even which ones it has dropped after the fact, so it’s pretty much impossible to use duplicates='drop' and labels together reliably.

4reactions
rgerkincommented, Nov 9, 2021

This fix could be implemented as follows: the duplicates kwarg should retain raise and drop for backwards compatibility, but add merge_left and merge_right to specify whether the the left-most or right-most bin’s label should be retained.

Example:

import numpy as np
import pandas as pd
data = np.random.rand(100)
bin_edges = [0, 0.1, 0.1, 0.7, 1]
labels = ['a', 'b', 'c', 'd']

# Raise unique bins error as currently
pd.cut(data, bin_edges, labels, duplicates='raise')

# Raise labels size error as currently
pd.cut(data, bin_edges, labels, duplicates='raise')  

# Perform as pd.cut(data, [0, 0.1, 0,7, 1], ['a', 'b', 'd'])
pd.qcut(data, bin_edges, labels, duplicates='merge_left')  

# Perform as pd.cut(data, [0, 0.1, 0,7, 1], ['a', 'c', 'd'])
pd.qcut(data, bin_edges, labels, duplicates='merge_right')
Read more comments on GitHub >

github_iconTop Results From Across the Web

How to qcut with non unique bin edges? - Stack Overflow
The problem is that pandas.qcut chooses the bins/quantiles so that each one has the same number of records, but all records with the...
Read more >
pandas.cut — pandas 1.5.2 documentation
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, ... If set duplicates=drop , bins will drop non-unique bin.
Read more >
Pandas | qcut method with Examples - SkyTowner
Pandas qcut(~) method categorises numerical values into quantile bins (intervals). ... The number of decimal places to include up until for the bin...
Read more >
ValueError: Bin Edges Must Be Unique - DataDrivenInvestor
Most common errors of qcut function in Pandas are solved. ValueError: Bin Edges Must Be Unique, Bin labels must be one fewer than...
Read more >
"Bin labels must be one fewer than the number of ... - YouTube
Pandas : "Bin labels must be one fewer than the number of bin edges" after passing pd. qcut duplicates='drop ' kwarg [ Beautify...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found