question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: Bins are unexpected for qcut when the edges are duplicated

See original GitHub issue

Code Sample, a copy-pastable example if possible

#
import pandas as pd
import numpy as np
values = np.empty(shape=10)
values[:3] = 0
values[3:5] = 1
values[5:7] = 2
values[7:9] = 3
values[9:] = 4
pd.qcut(values,5,duplicates='drop')

Problem description

The first bin contains both 0 and 1. Since I’m looking to put 20% in each bin I would expect to have the first bin to contain only 0’s (for 30% of the data) rather than 0’s and 1’s (for 50% of the data).

Expected Output

Output of pd.show_versions()

# INSTALLED VERSIONS ------------------ commit: None python: 2.7.11.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: en_US LOCALE: None.None

pandas: 0.20.1 pytest: 2.8.5 pip: 8.1.1 setuptools: 21.2.1 Cython: 0.23.4 numpy: 1.11.0 scipy: 0.17.0 xarray: None IPython: 4.0.3 sphinx: 1.3.5 patsy: 0.4.1 dateutil: 2.4.2 pytz: 2015.7 blosc: None bottleneck: 1.2.0 tables: 3.2.2 numexpr: 2.5.2 feather: None matplotlib: 1.5.1 openpyxl: 2.3.2 xlrd: 0.9.4 xlwt: 1.0.0 xlsxwriter: 0.8.4 lxml: 3.5.0 bs4: 4.4.1 html5lib: None sqlalchemy: 1.0.11 pymysql: None psycopg2: None jinja2: 2.8 s3fs: None pandas_gbq: None pandas_datareader: 0.2.1

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
wyegelwelcommented, Dec 1, 2017

I ran into this today. Consider the case:

In [3]: pd.qcut([1,1,1,1,2,3,4], 3, duplicates='drop')
Out[3]: 
[(0.999, 2.0], (0.999, 2.0], (0.999, 2.0], (0.999, 2.0], (0.999, 2.0], (2.0, 4.0], (2.0, 4.0]]
Categories (2, interval[float64]): [(0.999, 2.0] < (2.0, 4.0]]

In [9]: pd.Series([1,1,1,1,2,3,4]).quantile([ 0.        ,  0.33333333,  0.66666667,  1.        ])
Out[9]: 
0.000000    1.0
0.333333    1.0
0.666667    2.0
1.000000    4.0

Given this data with these quantile values, I would expect the bins to be [(0.9999,1] < [2,4)], however they are [(0.999, 2.0] < (2.0, 4.0]]

I think this is a bug in the qcut logic with duplicates.

Specifically, qcut decides on the quantiles using linspace if it isn’t specified. The linspace is np.linspace(0,1, num_quantiles+1). The bucket ranges are then constructed by taking consecutive pairs of the quantiles values.

The problem is if the min and first quantile values are duplicate, than we drop one and the first quantile is then treated as the min for the first bucket constructed.

I think the fix is if the 0th and 1st bin values are equal, to update the 0th bin value by subtracting a small epsilon instead of filtering it

0reactions
burkcommented, May 19, 2022

I believe I’ve hit the same, or a very related issue. When there are not enough distinct values to create bins, the output is dependent on how large the input array is. I would expect both these to generate two bins:

>>> pd.qcut([-1]*7 + [1] * 2, 5, labels=False, duplicates="drop")
array([0, 0, 0, 0, 0, 0, 0, 1, 1])
>>> pd.qcut([-1]*70 + [1] * 20, 5, labels=False, duplicates="drop")
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0])
Read more comments on GitHub >

github_iconTop Results From Across the Web

I have been trying to qcut an array of values into 4 bins. I am ...
qcut is not friendly with duplicated data and will throw an error when it sees a duplicate at splitting point. Imagine you do...
Read more >
pd.qcut bins error! - python - Data Science Stack Exchange
I'm working on some RFM analysis, but while setting up the bins, I'm getting an error. Here's my code and below the exact...
Read more >
ValueError: Bin Edges Must Be Unique - DataDrivenInvestor
ValueError: Bin Edges Must Be Unique, Bin labels must be one fewer than the number of bin edges. ... A common error for...
Read more >
pandas.qcut — pandas 0.22.0 documentation
The precision at which to store and display the bins labels. duplicates : {default 'raise', 'drop'}, optional. If bin edges are not unique,...
Read more >
11.5. Error messages - LAMMPS documentation
Must be periodic on both sides. Boundary command after simulation box is defined. The boundary command cannot be used after a read_data, read_restart,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found