Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bug: groupby with sort=False creates buggy MultiIndex

See original GitHub issue

d = pd.to_datetime(['2020-11-02', '2019-01-02', '2020-01-02', '2020-02-04', '2020-11-03', '2019-11-03', '2019-11-13', '2019-11-13'])
a = np.arange(len(d))
b = np.random.rand(len(d))
df = pd.DataFrame({'d': d, 'a': a, 'b': b})
t = df.groupby(['d', 'a'], sort=False).mean()

The index of t is certainly not sorted, but t.index.is_lexsorted() returns True.

Another more subtle example is

d = [3,4,10,0,1,2,5,3]
a = np.arange(len(d))
b = np.random.rand(len(d))
df = pd.DataFrame({'d': d, 'a': a, 'b': b})
t = df.groupby(['d', 'a'], sort=False).mean()

This time the lexsort flag is correct. However, calling sortlevel will not sort the new MultiIndex correctly, that is, t.index.sortlevel(['d', 'a'])[0] returns

MultiIndex([( 3, 0),
            ( 3, 7),
            ( 4, 1),
            (10, 2),
            ( 0, 3),
            ( 1, 4),
            ( 2, 5),
            ( 5, 6)],
           names=['d', 'a'])

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit : None python : 3.7.4.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-72-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.1 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 45.2.0.post20200210 Cython : 0.29.15 pytest : 5.3.5 hypothesis : 5.5.4 sphinx : 2.4.0 blosc : None feather : None xlsxwriter : 1.2.7 lxml.etree : 4.5.0 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.12.0 pandas_datareader: 0.8.1 bs4 : 4.8.2 bottleneck : 1.3.2 fastparquet : None gcsfs : None lxml.etree : 4.5.0 matplotlib : 3.1.3 numexpr : 2.7.1 odfpy : None openpyxl : 3.0.3 pandas_gbq : None pyarrow : 0.13.0 pytables : None pytest : 5.3.5 pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : 1.3.13 tables : 3.6.1 tabulate : None xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.2.7 numba : 0.45.1

Issue Analytics

State:
Created 4 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

1reaction

MarcoGorellicommented, Feb 27, 2020

If you do df.unstack(‘d’).stack().index, the resulting index is unsorted on level ‘d’, but its .is_lexsorted() returns True.

In this case, isn’t the current result correct though?

In [1]: import pandas as pd; import numpy as np                                                                                                                                                                  

In [2]: d = pd.to_datetime(['2020-11-02', '2019-01-02', '2020-01-02', '2020-02-04', '2020-11-03', '2019-11-03', '2019-11-13', '2019-11-14']) 
   ...: a = np.arange(len(d)) 
   ...: b = np.random.rand(len(d)) 
   ...: df = pd.DataFrame({'d': d, 'a': a, 'b': b}).set_index(['d', 'a']).take([3,2,4,1,0,5,6,7])                                                                                                                

In [3]: df.unstack('d').stack().index                                                                                                                                                                            
Out[3]: 
MultiIndex([(0, '2020-11-02'),
            (1, '2019-01-02'),
            (2, '2020-01-02'),
            (3, '2020-02-04'),
            (4, '2020-11-03'),
            (5, '2019-11-03'),
            (6, '2019-11-13'),
            (7, '2019-11-14')],
           names=['a', 'd'])

The index is sorted by its first argument, and so it is correct to say that it’s lexically sorted

1reaction

MarcoGorellicommented, Feb 26, 2020

(just some notes, will come back to this)

is_lexsorted assumes the order of .codes reflects the order of the elements in the index. But .codes is actually determined by

                codes, uniques = algorithms.factorize(self.grouper, sort=self.sort)

in pandas/core/groupby/grouper.py, where self.sort may be False

Top Results From Across the Web

Setting sort = False in groupby unstacking my multiindex

I want the behavior of sort = True, without the sorting. Here is the code for the dataframe: import pandas as pd print(pd.__version ......

What's New — pandas 0.19.2 documentation

Bug in groupby-transform broadcasting that could cause incorrect dtype coercion ... Bug in MultiIndex.set_levels where illegal level values were still set ...

What's New — pandas 0.23.4 documentation

Bug in to_csv() causes encoding error when compression and encoding are specified ... In previous versions, .groupby(..., sort=False) would fail with a ...

64 Pandas (Part 41): GroupBy - 2: MultiIndex, Sort ... - YouTube

The video discusses GroupBy in Pandas in Python using MultiIndex DataFrame, ... GroupBy : sort=False : .sum() 03:16 - GroupBy : create grouped ......

Performing Groupings on Multi-Index Pandas DataFrames

The multi-index dataframe is now as shown: If you set sort=False in the groupby() function, you will get a different result ...