pandas.cut: the 'include_lowest' argument isn't behaving as documented
See original GitHub issueCode Sample
import pandas as pd
import numpy as np
pd.cut(np.array([1, 7, 5, 4, 6, 3]), bins=[0, 3, 6, 8], include_lowest=True)
Problem description
Just by setting the include_lowest
to True
the data type of the interval changes from int64
to float64
and the first interval isn’t left-inclusive. Here is the wrong output that you’ll get:
[(-0.001, 3.0], (6.0, 8.0], (3.0, 6.0], (3.0, 6.0], (3.0, 6.0], (-0.001, 3.0]]
Categories (3, interval[float64]): [(-0.001, 3.0] < (3.0, 6.0] < (6.0, 8.0]]
Expected Output
[(0, 3], (6, 8], (3, 6], (3, 6], (3, 6], (0, 3]]
Categories (3, interval[int64]): [[0, 3] < (3, 6] < (6, 8]]
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4 pytest: 3.8.2 pip: 10.0.1 setuptools: 40.4.3 Cython: 0.28.5 numpy: 1.15.2 scipy: 1.1.0 pyarrow: None xarray: None IPython: 7.0.1 sphinx: 1.8.1 patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.8 feather: None matplotlib: 3.0.0 openpyxl: 2.5.8 xlrd: 1.1.0 xlwt: 1.2.0 xlsxwriter: 1.1.1 lxml: 4.2.5 bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: 1.2.12 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 5 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
include_lowest giving unexpected results in pandas.cut
So I came up with a "hack" for the time being, and I'm putting it here in case it's useful for someone: out...
Read more >pandas.cut — pandas 1.5.2 documentation
Use cut when you need to segment and sort data values into bins. This function is also useful for ... This parameter can...
Read more >Pandas.cut() method in Python - GeeksforGeeks
bins: defines the bin edges for the segmentation. right : (bool, default True ) Indicates whether bins includes the rightmost edge or not....
Read more >All Pandas cut() you should know for transforming numerical ...
There is an argument right in Pandas cut() to configure whether bins include the rightmost edge or not. right defaults to True ,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
As suggested in #42212 (@mroeschke suggested to include it into the discussion here instead), I would love to also have the option to include not only the left-most boundary for left-open intervals but also the right-most boundary for right-open intervals (i.e. for
right=False
).My preferred solution would be to change
include_lowest
to something likefinal_interval_closed
, which is also functional for right-open intervals (i.e. ifright=False
is specified).Maybe, if the solution to this issue is to adapt the function or the underlying
IntervalIndex
instead of updating the documentation, this could be kept in mind and implemented as well.I’ve looked more into how to resolve this and it seems this is more a restriction of IntervalIndex than a documentation error. IntervalIndex requires that all bins are closed on the same side so
(3, interval[int64]): [[0, 3] < (3, 6] < (6, 8]]
is not valid.This is solved by slightly reducing the lower bound of the first interval and making it lower exclusive. Unfortunately, this converts the interval from int64 to float64 and creates the following unexpected behavior:
You can see that even though -0.0001 should be in the first interval, it is not assigned to the first bin.
I am new to this so I was wondering what the best way to handle this is?