question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pandas.cut: the 'include_lowest' argument isn't behaving as documented

See original GitHub issue

Code Sample

import pandas as pd
import numpy as np

pd.cut(np.array([1, 7, 5, 4, 6, 3]), bins=[0, 3, 6, 8], include_lowest=True)

Problem description

Just by setting the include_lowest to True the data type of the interval changes from int64 to float64 and the first interval isn’t left-inclusive. Here is the wrong output that you’ll get:

[(-0.001, 3.0], (6.0, 8.0], (3.0, 6.0], (3.0, 6.0], (3.0, 6.0], (-0.001, 3.0]]
Categories (3, interval[float64]): [(-0.001, 3.0] < (3.0, 6.0] < (6.0, 8.0]]

Expected Output

[(0, 3], (6, 8], (3, 6], (3, 6], (3, 6], (0, 3]]
Categories (3, interval[int64]): [[0, 3] < (3, 6] < (6, 8]]

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.4 pytest: 3.8.2 pip: 10.0.1 setuptools: 40.4.3 Cython: 0.28.5 numpy: 1.15.2 scipy: 1.1.0 pyarrow: None xarray: None IPython: 7.0.1 sphinx: 1.8.1 patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.8 feather: None matplotlib: 3.0.0 openpyxl: 2.5.8 xlrd: 1.1.0 xlwt: 1.2.0 xlsxwriter: 1.1.1 lxml: 4.2.5 bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: 1.2.12 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jotasicommented, Aug 27, 2021

As suggested in #42212 (@mroeschke suggested to include it into the discussion here instead), I would love to also have the option to include not only the left-most boundary for left-open intervals but also the right-most boundary for right-open intervals (i.e. for right=False).

My preferred solution would be to change include_lowest to something like final_interval_closed, which is also functional for right-open intervals (i.e. if right=False is specified).

Maybe, if the solution to this issue is to adapt the function or the underlying IntervalIndex instead of updating the documentation, this could be kept in mind and implemented as well.

1reaction
gdex1commented, Jul 12, 2019

I’ve looked more into how to resolve this and it seems this is more a restriction of IntervalIndex than a documentation error. IntervalIndex requires that all bins are closed on the same side so (3, interval[int64]): [[0, 3] < (3, 6] < (6, 8]] is not valid.

This is solved by slightly reducing the lower bound of the first interval and making it lower exclusive. Unfortunately, this converts the interval from int64 to float64 and creates the following unexpected behavior:

In: 
pd.cut(np.array([-0.0001, 0, 1, 7, 5, 4, 6, 3, 8]), bins=[0, 3, 6, 8], include_lowest=True)

Out: 
[NaN, (-0.001, 3.0], (-0.001, 3.0], (6.0, 8.0], (3.0, 6.0], (3.0, 6.0], (3.0, 6.0], (-0.001, 3.0], (6.0, 8.0]]
Categories (3, interval[float64]): [(-0.001, 3.0] < (3.0, 6.0] < (6.0, 8.0]]

You can see that even though -0.0001 should be in the first interval, it is not assigned to the first bin.

I am new to this so I was wondering what the best way to handle this is?

Read more comments on GitHub >

github_iconTop Results From Across the Web

include_lowest giving unexpected results in pandas.cut
So I came up with a "hack" for the time being, and I'm putting it here in case it's useful for someone: out...
Read more >
pandas.cut — pandas 1.5.2 documentation
Use cut when you need to segment and sort data values into bins. This function is also useful for ... This parameter can...
Read more >
Pandas.cut() method in Python - GeeksforGeeks
bins: defines the bin edges for the segmentation. right : (bool, default True ) Indicates whether bins includes the rightmost edge or not....
Read more >
All Pandas cut() you should know for transforming numerical ...
There is an argument right in Pandas cut() to configure whether bins include the rightmost edge or not. right defaults to True ,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found