Bug: groupby with sort=False creates buggy MultiIndex
See original GitHub issued = pd.to_datetime(['2020-11-02', '2019-01-02', '2020-01-02', '2020-02-04', '2020-11-03', '2019-11-03', '2019-11-13', '2019-11-13'])
a = np.arange(len(d))
b = np.random.rand(len(d))
df = pd.DataFrame({'d': d, 'a': a, 'b': b})
t = df.groupby(['d', 'a'], sort=False).mean()
The index of t
is certainly not sorted, but t.index.is_lexsorted()
returns True.
Another more subtle example is
d = [3,4,10,0,1,2,5,3]
a = np.arange(len(d))
b = np.random.rand(len(d))
df = pd.DataFrame({'d': d, 'a': a, 'b': b})
t = df.groupby(['d', 'a'], sort=False).mean()
This time the lexsort flag is correct. However, calling sortlevel will not sort the new MultiIndex correctly, that is, t.index.sortlevel(['d', 'a'])[0]
returns
MultiIndex([( 3, 0),
( 3, 7),
( 4, 1),
(10, 2),
( 0, 3),
( 1, 4),
( 2, 5),
( 5, 6)],
names=['d', 'a'])
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit : None python : 3.7.4.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-72-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.0.1 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 45.2.0.post20200210 Cython : 0.29.15 pytest : 5.3.5 hypothesis : 5.5.4 sphinx : 2.4.0 blosc : None feather : None xlsxwriter : 1.2.7 lxml.etree : 4.5.0 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.12.0 pandas_datareader: 0.8.1 bs4 : 4.8.2 bottleneck : 1.3.2 fastparquet : None gcsfs : None lxml.etree : 4.5.0 matplotlib : 3.1.3 numexpr : 2.7.1 odfpy : None openpyxl : 3.0.3 pandas_gbq : None pyarrow : 0.13.0 pytables : None pytest : 5.3.5 pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : 1.3.13 tables : 3.6.1 tabulate : None xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.2.7 numba : 0.45.1
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (6 by maintainers)
Top GitHub Comments
In this case, isn’t the current result correct though?
The index is sorted by its first argument, and so it is correct to say that it’s lexically sorted
(just some notes, will come back to this)
is_lexsorted
assumes the order of.codes
reflects the order of the elements in the index. But.codes
is actually determined byin
pandas/core/groupby/grouper.py
, whereself.sort
may be False